Homework 3 - Exploring Data#

Written by Todd Gureckis

So far in this course we’ve learned a bit about using Jupyter, some Python basics, and have been introduced to Pandas dataframes and at least a start at some of the ways to plot data.

In this lab, we are going to pull those skills into practice to try to “explore” a data set. At this point we are not really going to be doing much in terms of inferential statistics. Our goals are just to be able to formulate a question and then try to take the steps necessary to compute descriptive statistics and plots that might give us a sense of the answer. You might call it “Answering Questions with Data.”

First, we load the basic packages we will be using in this tutorial. Remeber how we import the modules using an abbreviated name (import XC as YY). This is to reduce the amount of text we type when we use the functions.

Note: %matplotlib inline is an example of something specific to Jupyter call ‘cell magic’ and enables plotting within the notebook and not opening a separate window.

%matplotlib inline 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
from datetime import datetime
import random

Reminders of basic pandas functions#

As a reminder of some of the pandas basics lets revisit the data set of professor salaries we have considered over the last few lectuers.

salary_data = pd.read_csv(
    "https://teaching.gureckislab.org/spring24/labincp/data/salary.csv",
    sep=",",
    header="infer",
)

Notice that the name of the dataframe is now called salary_data instead of df. It could have been called anything as it is just our variable to work with. However, you’ll want to be careful with your nameing when copying commands and stuff from the past.

peek at the data#

salary_data.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

Get the size of the dataframe#

In rows, columns format

salary_data.shape
(77, 6)

Access a single column#

salary_data[["salary"]]
salary
0 86285
1 77125
2 71922
3 70499
4 66624
... ...
72 53662
73 57185
74 52254
75 61885
76 49542

77 rows × 1 columns

Compute some descriptive statistics#

salary_data[["salary"]].describe()
salary
count 77.000000
mean 67748.519481
std 15100.581435
min 44687.000000
25% 57185.000000
50% 62607.000000
75% 75382.000000
max 112800.000000
salary_data[["salary"]].count()  # how many rows are there?
salary    77
dtype: int64

creating new column based on the values of others#

salary_data["pubperyear"] = 0
salary_data["pubperyear"] = salary_data["publications"] / salary_data["years"]

Selecting only certain rows#

sub_df = salary_data.query("salary > 90000 & salary < 100000")
sub_df
salary gender departm years age publications pubperyear
14 97630 0 chem 34.0 64.0 43 1.264706
30 92951 0 neuro 11.0 41.0 20 1.818182
54 96936 0 physics 15.0 50.0 17 1.133333
sub_df.describe()
salary gender years age publications pubperyear
count 3.00000 3.0 3.000000 3.000000 3.000000 3.000000
mean 95839.00000 0.0 20.000000 51.666667 26.666667 1.405407
std 2525.03802 0.0 12.288206 11.590226 14.224392 0.363458
min 92951.00000 0.0 11.000000 41.000000 17.000000 1.133333
25% 94943.50000 0.0 13.000000 45.500000 18.500000 1.199020
50% 96936.00000 0.0 15.000000 50.000000 20.000000 1.264706
75% 97283.00000 0.0 24.500000 57.000000 31.500000 1.541444
max 97630.00000 0.0 34.000000 64.000000 43.000000 1.818182

Get the unique values of a columns#

salary_data["departm"].unique()
array(['bio', 'chem', 'geol', 'neuro', 'stat', 'physics', 'math'],
      dtype=object)

How many unique department values are there?

salary_data["departm"].unique().size
7

or

len(salary_data["departm"].unique())
7

Breaking the data into subgroups#

male_df = salary_data.query("gender == 0").reset_index(drop=True)
female_df = salary_data.query("gender == 1").reset_index(drop=True)

Recombining subgroups#

pd.concat([female_df, male_df], axis=0).reset_index(drop=True)
salary gender departm years age publications pubperyear
0 59139 1 bio 8.0 38.0 23 2.875000
1 52968 1 bio 18.0 48.0 32 1.777778
2 55949 1 chem 4.0 34.0 12 3.000000
3 58893 1 neuro 10.0 35.0 4 0.400000
4 53662 1 neuro 1.0 31.0 3 3.000000
... ... ... ... ... ... ... ...
71 82142 0 math 9.0 39.0 9 1.000000
72 70509 0 math 23.0 53.0 7 0.304348
73 60320 0 math 14.0 44.0 7 0.500000
74 55814 0 math 8.0 38.0 6 0.750000
75 53638 0 math 4.0 42.0 8 2.000000

76 rows × 7 columns

Scatter plot two columns#

ax = sns.regplot(x=salary_data.age, y=salary_data.salary)
ax.set_title("Salary and age")
Text(0.5, 1.0, 'Salary and age')
../_images/Homework3_34_1.svg

Histogram of a column#

sns.displot(salary_data["salary"])
<seaborn.axisgrid.FacetGrid at 0x123e0fd30>
../_images/Homework3_36_1.svg

You can also combine two different histograms on the same plot to compared them more easily.

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 3), sharey=True, sharex=True)
sns.histplot(
    male_df["salary"],
    ax=ax1,
    bins=range(male_df["salary"].min(), male_df["salary"].max(), 5000),
    kde=False,
    color="b",
)
ax1.set_title("Salary distribution for males")
sns.histplot(
    female_df["salary"],
    ax=ax2,
    bins=range(female_df["salary"].min(), female_df["salary"].max(), 5000),
    kde=False,
    color="r",
)
ax2.set_title("Salary distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1, ax2):
    sns.despine(ax=ax)
fig.tight_layout()
../_images/Homework3_38_0.svg

Group the salary column using the department name and compute the mean#

salary_data.groupby("departm")["salary"].mean()
departm
bio        63094.687500
chem       66003.454545
geol       73548.500000
math       60920.875000
neuro      76465.600000
physics    67987.000000
stat       67242.800000
Name: salary, dtype: float64

Group the age column using the department name and compute the modal age of the faculty#

First let’s check the age of everyone.

salary_data.age.unique()
array([64., 58., 38., 46., 41., 60., 53., 40., 50., 43., 56., 61., 65.,
       nan, 45., 48., 34., 37., 44., 39., 49., 59., 32., 35., 51., 42.,
       31., 33.])

Ok, there are a few people who don’t have an age so we’ll need to drop them using .dropna() before computing the mode.

Since there doesn’t seem to me a mode function provided by default we can write our own custom function and use it as a descriptive statistics using the .apply() command. Here is an example of how that works.

def my_mode(my_array):
    counts = np.bincount(my_array)
    mode = np.argmax(counts)
    return mode, counts[mode]


# wee need to drop the
salary_data.dropna().groupby("departm")["age"].apply(my_mode)
departm
bio        (38, 4)
chem       (34, 3)
geol       (37, 1)
math       (33, 1)
neuro      (41, 3)
physics    (50, 2)
stat       (32, 2)
Name: age, dtype: object

OkCupid Data#

This document is an analysis of a public dataset of almost 60000 online dating profiles. The dataset has been published in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al., and its collection and distribution was explicitly allowed by OkCupid president and co-founder Christian Rudder. Using these data is therefore ethically and legally acceptable; this is in contrast to another recent release of a different OkCupid profile dataset, which was collected without permission and without anonymizing the data (more on the ethical issues in this Wired article).

profile_data = pd.read_csv(
    "https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles_revised.csv.zip?raw=true",
    compression="zip",
)

Citibike Data#

As you know, Citibikes are the bike share system in NYC. What you might not realize is that a lot of the data about citibikes is made public by the city. As a result it is a fun dataset to download and to explore. Although this dataset is not exactly “cognition and perception” it is readily available and is a great way to train up your exploratory data analysis skills!

Loading Data#

The data for the citibike are provided on a per-month basis as a .zip file. As a result we first have to download the file and unzip it. Luckily python can do that for you!

Remember, we use the pd.read_csv() to load a .csv file into a Pandas dataframe. In this case our citibike data frame will be called trip_data. Be sure to read the code and try to get a sense of what it is doing. Ask questions if you are unsure.

trip_data = pd.read_csv(
    "https://teaching.gureckislab.org/spring24/labincp/data/JC-202401-citibike-tripdata.csv"
)

This command loads the data from Jan 2024 (The data is called JC-202401-citibike-tripdata.csv). Can you guess how we know the date of the data?

Turning in homeworks#

When you are finished with this notebook. Save your work in order to turn it in. To do this select File->Download As…->PDF.

You can turn in your assignments in Gradescope (will be described in class).