6. The Format and Structure of Digital Data#

Note

This chapter authored by Todd M. Gureckis and Kelsey Moty and is released under the license for this book.

6.1. Video Lecture#

This video provides an complementary overview of this chapter. There are things mentioned in the chapter not mentioned in the video and vice versa. Together they give an overview of this unit so please read and watch.

6.2. How to think about data and organize it#

Data is an important concept in science. The first step is we measure something. In a previous chapter we discuss issues in measurement including different types of scales, units, etc… However the main thing is that data is born when a number is assigned to some observation we make. Here is a lonely single number which maybe measures something:

Things get more interesting when we make multiple observations and so we have many data. In order to do anything with more than one number though we start running into the question of how to organize things. How do we keep track of which number was collected first, second or third for instance? If we have just a big jumble of numbers we can easily get lost.

Before we get too many numbers we need to start thinking abstractly about organzing our measurements into some type of collection. In our previous chapter on the basics of Python we discussed the concept of collections. Collections are things in the native Python language space that organize multiple numbers, string, or even other collections. This included lists (which organize things in some order), dictionaries (which do not preseve order but allow “looking up” the value of a number by a special address known as a key), or sets (which are unordered collections of things useful for counting, and performing set operations like union, etc…).

For instance if we measured a number repeatedly in time a list might be a useful way to organize them:

Lets imagine the numbers above represent some measurement on a person on three different days (monday, tuesday, wednesday). It might be their blood pressure or temperature. Learning a lot about one person is nice and fits cleanly into a list. However, more often it gets more complex. What we if instead have multiple people we are recording information from and so the data starts looking two dimensional. There is maybe 3 people in the study and each has 3 measurements. In that case we might then organize the data as a list of lists (or a matrix):

Although a matrix is a nice way to organize multiple observations made on multiple people it can get a little bit confusing. Are the rows or the columns the subjects in the example above? You have to remember to write that down otherwise you could easily get confused. What is this data? What does it describe? For this reason we might look beyond standard Python collections to more complex structures.

For example, you are all likely familiar with the concept of a spreadsheet (e.g., Microsoft Excel or Google Sheets). These are nicer than matricies because they have named rows and columns and so just by looking at the structure of the spreadsheet you can often learn more about the structure of the data. Columns names are sometimes known as metadata because they are data themselves that describe other data.

This is much nicer because the columns names help us remember what each measurement records and also what is contained in the rows.

The two dimensional structure of a spreadsheet is generally the most complex types of data that we will deal with. However, just so you are aware, as data gets really big it can make sense to take other organizational approaches. For instance, a website that had millions of users reviewing movies might not want to make a long spreadsheet recording each user and the rating they gave and the movie the rating was about. Stop and think for a moment how you could do that in a spreadsheet. Perhaps you could make each row a user and each column a movie. However, as you know Netflix and other sites have hundred of thousands of TV shows and movies and so the data would be really unweildy. Also most of the data would be “missing” because people only provide a rating for a small subset of the total number of movies. As a result, big websites adopt alternative ways of organizating data including the use of what are known as relational databases.

Relational databases are made up of multiple spreadsheets (effectively) where each one represents only a smaller piece of the data. In the figure above the green words are columns of a smaller spreadsheet. This database is thus organized into four separate spreadsheets (or “tables”). For instance the ratings table has an unique id number (i.e. row number) and then the rating a person gave, a user id of who gave it, and a movie id for which movie it was about. Separately there is a movies table which had its own id (unique id for each movie), the title or name of the movie, and a description/summary of the movie. The orange lines reflect links where the value in one columns of one table connects with or refers to the entries of another one. This can be a much more efficient way to organize the data.

The main point this example brings forward is that the way you organize your data is something you really have to think about and plan to begin with. I’ve found that this topic actually is intuitively interesting to many students. The reason is that we love organizing our homes and offices. It feels great. When it comes to data taking the same mentality - the “fresh” feeling of being organized, is really key to making good scientific analyses that do not have bugs. Problems with data organization can even be deadly (How not to kill people with spreadsheets). If you get really interested in organizing data, there are entire books [Glu13] and fields of study about how to best organize data (e.g., library and information sciences). The choices you make in how to organize your data at one point in time really influence how easy it can be to do certain analyses later. More on this later.

6.3. Common file formats for data#

Data often come in files with particular formatted structure. This formatted structure makes it easier for computer programs to read in the data and manipulate it. In this section we will go over a couple of the common data formats you might encounter in traditional psychological research.

6.3.1. Excel Workbooks (.xls, .xslx files)#

XLSX file format (or XLS in older versions of Excel) is the default file format used in Excel. Under the hood, Excel Workbooks are built using a highly structured markup language called Extensible Markup Language (XML). Essentially what this means is while you are using Excel’s graphic interface to edit your data, XML is adding a bunch of tags behind the scenes to the XLSX file so it knows how to properly format the data each time you open it up. All of these tags in XML are what allow you to change font colors and sizes, add graphs and images, and use formulas.

This is what you see when you when you use Excel:

And this is what the exact same Workbook looks like behind the scenes:

But this complexity can also make it difficult to use XLSX files with any other software or programming language that isn’t Excel. For one, not all programs can open XLSX files, so these data files can make your data inaccessible to other people. Two, all of this special formatting can sometimes lead to problems when you read in data from XLSX files (e.g., converting variables to weird formats). To avoid issues like these, it is preferable to store your data using plain-text formats (such as CSV and TSV files).

6.3.2. Comma-separated Values (.csv files)#

CSV, or Comma-separated value files, on the other hand, are written using a plain-text format. No special tags, just text. The only formatting in CSV files are the commas used to delimit different values for columns and new lines to indicate rows. You can see this if you open up a CSV file using Notepad (or other text editors).

Here’s the same dataset as before but now as a CSV file:

This means that what you can store within a CSV file is quite limited (no images or special formatting, for example). But it also means that this kind of file is great for storing and sharing data. CSV files can be opened by any software, making it accessible to others. Given that CSV files inherently have no formatting, some of the issues that come with using XLSX files never arise.

6.3.3. Tab-separated Values (.tsv files)#

TSV files are very similar to CSV files, except that instead of using a comma to delimit between values in a dataset, it uses tabs (a special type of spacing, the same type you get when you hit the tab key on your computer).

See also

It is a little bit beyond the scope of our class, but another file format you might encounter is referred to as JSON which stands for Javascript Object Notation (JSON). JSON is similar to a python dictionary and is stored in a plain text file like a CSV. However, JSON is somewhat more flexible than the two-dimensional structure of a spreadsheet. Often one analysis step is to convert JSON into a 2D spreadsheet-type structure. Here is a helpful guide to JSON in python.

6.4. Exporting from spreadsheet programs#

6.4.1. Google Sheets#

To export a .csv file from Google Sheets, you click on File > Download > Comma-separated values (.csv).

If you created a Google Sheet with multiple sheets, you will have to save each individual sheet as a .csv file because .csv files do not support multiple sheets. To do this, you will have to click on each sheet and go through the save process for each sheet.

6.4.2. Excel#

To create a .csv file from Excel, click on File (found on the top left of the menu bar).

A new menu will appear. From there, select “Save as” and then choose where on your computer you want to save the file. A pop-up window will open up, and from there, you can choose what to name the file and what kind of file type to save it as. To save it as a will there will be a dropdown menu where you can select which kind of file type you would like to save the file as (labeled “Save as type”).

6.5. Uploading csv files to JupyterHub#

Many of the datasets you will be working with this semester will already be available for you to use on JupyterHub. However, if you want to work with your own datasets, you will need to upload them yourselves to the cluster.

To do that, go to the “Files” tab on JupyterHub and use the “Upload” button.

6.6. The Pandas library and the concept of a dataframe API#

Throughout this class there are several libraries (i.e., tools which extend the capabilities of the core Python language) which we will use a lot. One of those is Pandas. Pandas is a open-source project for Python that provides a data analysis and manipulation tool. Pandas gives you a way to interact with a dataset organized like a spreadsheet (meaning columns and rows which are possibly named or labeled with meta data) in order to perform sophisticated data analyses. Pandas is really the “backbone” of the Python data science and modeling environment. Pandas could be thought of as a langauge (or API) for unlocking tabular data. If you want to become better at data analysis with Python there is no single package I can think of that is worth more of your time learning.

Although there is no required book for this class, “Learning the Pandas Library: Python Tools for Data Munging, Analysis and Visualization” by Matt Harrison [Har16] is highly recommended both because it is short and to the point but also because it explains key Pandas concepts very clearly. In addition, the Pandas project includes a very helpful user guide which explains many of the key concepts.

Pandas can be confusing at first!

I was aware of Pandas for several years and never really “understood it.” However, when it clicked it opened a universe of data analysis to me. It takes some patience in part because data manipulation is a very complex topic at least at the conceptual level.

The organization of this guide is not to give a complete description of everything in Pandas but to kind of give you a sense of how you use Pandas (with code) to do the types of tasks you typically would do in a spreadsheet like Excel or Google Sheets. In addition, we show a few of the key features of Pandas that are not that easy to do in Excel which are very useful for behavioral data analysis.

The first step of understanding Pandas is to load it:

import numpy as np
import pandas as pd

The two code lines above are the typical way to load in the Pandas library. Numpy is a related library that Pandas is built on top of (it really is like a Russian doll where one project enables the next!). Tradionally, pandas is imported ‘as’ pd and numpy ‘as’ np. This means that when you are reading code online if you see pd.somefunction() you can probably assume that the function is part of the pandas library because the function begins with the pd. syntax.

6.7. Reading data (e.g., csvs) into Python using Pandas#

Before getting further with Pandas it helps to first understand how to read some data into a dataframe object or how to create a dataframe from scratch. Usually you are reading data in from some other file (e.g., a CSV file) and “loading” it into Pandas. Pandas has many different ways for reading in different file types. Here, we will use pd.read_csv() because we will mostly be working with .csv files in this class.

When reading in your .csv file, there are two things you absolutely have to do:

First, you need to store the data into a variable. Otherwise, you won’t be able to work with your data. Here, we called the dataframe “df” but you can name it whatever you want.

# incorrect: won't store your data
pd.read_csv('salary.csv')

# correct: creates a dataframe called df that you can work with 
df = pd.read_csv('salary.csv')

Second, you need to tell pd.read_csv where the file you are trying to import is located. The path can point to a file located in a folder in your local enviroment (e.g., on your computer, on JupyterHub) or to a file available online.

To point to a file available online, you’ll put the link to the file.

df = pd.read_csv('http://gureckislab.org/courses/fall19/labincp/data/salary.csv')

If you are pointing to a file on JupyterHub (or to a file on your computer, if you had Python downloaded on your computer), you’ll need to specify the path to the file. If the .csv file and your notebook are in the same folder, you only have to put the name of the file.

df = pd.read_csv('salary.csv')

If the file is located in a different folder than the notebook you are working with, you will have to specify which folder(s) that the computer needs to look at to find it.

df = pd.read_csv('folder1/folder2/salary.csv')

Sometimes, your .csv file might be saved in a folder that’s not in the same folder as your notebook. For example, say you have a folder called “project”. And in that the folder, there was a folder called “code” that contained your notebooks/python code, as well as a folder called “data” that contained your data (.csv) files. To import your .csv file, you need to use .. to tell the your computer go up one folder in your directory (get out of the “code” folder) in order for it to find the “data” folder.

df = pd.read_csv('../data/salary.csv')

By default, pd.read_csv assumes that file uses commas to differentiate between different entries in the file. But you can also explicitly tell pd.read_csv that you want to use commas as the delimiter.

df = pd.read_csv('salary.csv', sep = ",")

pd.read_csv also assumes by default that the first row in your .csv files lists the names for each column of data (the column headers). You can explicitly tell pd.read_csv to do this by writing:

df = pd.read_csv('salary.csv', sep = ",", header = 'infer')

Sometimes, datasets may have multiple headers. (e.g., both the first and second rows of the dataset list column names). pd.read_csv allows you to keep both rows as headers by modifying the header argument with a list of integers. Remember that 0 actually means the first row.

df = pd.read_csv('salary.csv', sep = ",", header = [0,1])

When creating your own datasets, it’s generally best practice to give your columns headers, so that’s it easier for people looking at your data (including yourself!) to know what’s in the dataset. However, you may occassionally come across datasets that do not have headers. To tell pd.read_csv there’s no header to import, set header to “None”:

df = pd.read_csv('salary.csv', sep = ",", header = None)

:::{warning} When creating your own dataset, refrain from using characters like space or period (.) in the column names. This will make things easier for you down the line when using Pandas for statistical modeling. :::

6.8. Viewing the data in a dataframe#

So now we have loaded a dataset into a variable called df. Now we might like to look at the data to check it was properly read in and also to learn more about the structure of this dataset. Perhaps the simplest method is simply to type the name of the dataframe variable by itself in a code cell:

df = pd.read_csv('http://gureckislab.org/courses/fall19/labincp/data/salary.csv')
df
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23
... ... ... ... ... ... ...
72 53662 1 neuro 1.0 31.0 3
73 57185 1 stat 9.0 39.0 7
74 52254 1 stat 2.0 32.0 9
75 61885 1 math 23.0 60.0 9
76 49542 1 math 3.0 33.0 5

77 rows × 6 columns

This outputs a “table” view of the dataframe showing the column names, and several of the rows of the dataset. It doesn’t show you all of the data at once because in many large files this would be too much to make sense of.

You can also specifically request the first several rows of the dataframe:

df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

or the last several rows:

df.tail()
salary gender departm years age publications
72 53662 1 neuro 1.0 31.0 3
73 57185 1 stat 9.0 39.0 7
74 52254 1 stat 2.0 32.0 9
75 61885 1 math 23.0 60.0 9
76 49542 1 math 3.0 33.0 5

The “head” of the dataframe is the top. The “tail” of the dataframe is the bottom.

We can also access individual rows and columns of a data frame.

6.8.1. Accessing individual columns#

To access a single column you can index it like a dictionary in Python using the column name as a key:

df['salary']
0     86285
1     77125
2     71922
3     70499
4     66624
      ...  
72    53662
73    57185
74    52254
75    61885
76    49542
Name: salary, Length: 77, dtype: int64
df['age']
0     64.0
1     58.0
2     38.0
3     46.0
4     41.0
      ... 
72    31.0
73    39.0
74    32.0
75    60.0
76    33.0
Name: age, Length: 77, dtype: float64

Another allowed syntax is to use the name of the dataframe variable and a .columnname. For instance:

df.salary
0     86285
1     77125
2     71922
3     70499
4     66624
      ...  
72    53662
73    57185
74    52254
75    61885
76    49542
Name: salary, Length: 77, dtype: int64

However the dictionary key-lookup method is preferred because it is possible that the name of a column is the same as a dataframe method (see below) and that causes Python to get confused.

:::{seealso} When we learn about visualizing data, describing data, and linear regression you will see how the column struture of dataframes becomes very powerful. :::

6.8.2. Accessing individal rows#

Since the Python bracket notation is use to lookup columns, a special command is needed to access rows. The best way to look up a single row is to use .iloc[] where you pass the integer row number you want to access (zero indexed). So to get the first row you type:

df.iloc[0]
salary          86285
gender              0
departm           bio
years              26
age                64
publications       72
Name: 0, dtype: object

Here is the 5th row:

df.iloc[4]
salary          66624
gender              0
departm           bio
years              11
age                41
publications       23
Name: 4, dtype: object

or counting three backwards from the end:

df.iloc[-3]
salary          52254
gender              1
departm          stat
years               2
age                32
publications        9
Name: 74, dtype: object

6.8.3. Indexes and Columns#

Also note that there are two special elements of a normal dataframe called the column index and the row index (or just index). The row index is the column on the left that has no name but seems like a counter of the rows (e.g., 72, 73, 74, …). The row index is useful in Pandas dataframes for looking things up by row. Although you can index a row by counting (access the fifth row for instance), the index can be made on arbitrary types of data including strings, etc… You don’t need to know a ton about indexes to use Pandas typically but every once in a while they come up so it is useful to know the special nature of the index column.

df.index
RangeIndex(start=0, stop=77, step=1)
df.columns
Index(['salary', 'gender', 'departm', 'years', 'age', 'publications'], dtype='object')

The above code shows how to find the row index and column index.

You can change the row index to another column:

df.set_index('departm')
salary gender years age publications
departm
bio 86285 0 26.0 64.0 72
bio 77125 0 28.0 58.0 43
bio 71922 0 10.0 38.0 23
bio 70499 0 16.0 46.0 64
bio 66624 0 11.0 41.0 23
... ... ... ... ... ...
neuro 53662 1 1.0 31.0 3
stat 57185 1 9.0 39.0 7
stat 52254 1 2.0 32.0 9
math 61885 1 23.0 60.0 9
math 49542 1 3.0 33.0 5

77 rows × 5 columns

Or to reset it to a counter that count the row number use .reset_index().

df2=df.set_index('departm')
df2.reset_index()
departm salary gender years age publications
0 bio 86285 0 26.0 64.0 72
1 bio 77125 0 28.0 58.0 43
2 bio 71922 0 10.0 38.0 23
3 bio 70499 0 16.0 46.0 64
4 bio 66624 0 11.0 41.0 23
... ... ... ... ... ... ...
72 neuro 53662 1 1.0 31.0 3
73 stat 57185 1 9.0 39.0 7
74 stat 52254 1 2.0 32.0 9
75 math 61885 1 23.0 60.0 9
76 math 49542 1 3.0 33.0 5

77 rows × 6 columns

Note the syntax we’ve used a few times here… we referenced the df variable which is the variable we created to store the data from our file. Then the .functionname() is known as a method of the data frame which is provided by pandas. For instance the .head() method prints out the first five rows of the data frame. The .tail() method prints out the last five rows of the data frame. There are many other methods available on data frames that you access by calling it using the .functionname() syntax. The next parts of this chapter explain some of the most common ones.

6.9. Adding and deleting things from a dataframe#

Sometimes after we read in a dataset we might want to add new rows or columns or delete rows and columns from a dataframe. One way to do this is to edit the original CSV file that we read in. However, there is an important principle I want to emphasize thoughout this class:

ALWAYS do everything in code

What do I mean by do everything in code? What I mean is that if you go to your data file that you got from some place and then by hand delete some of the data in Google Sheets or Excel, noone will know that you did that. There is no record of it. Once you save the file the data will be deleted and noone will know you did this. Instead if you keep your data as “raw” as possible and modify it using code, your code will document ALL of the steps you did in your analysis including the step of deleting data. “Excluding” (not DELETEing) data is sometimes justified but the important thing is we want to document all our steps honestly and truthfully when doing analysis. Using code to do every single step of an analysis helps us accomplish this.

6.9.1. Deleting rows#

To delete a row you can use the .drop() method to drop a particular item using its index value. The .drop() method is not an “in place operation” instead it returns a new dataframe with the particular rows removed.

“In place” operations

From time to time using pandas you will here about a method being described as “in place” or “not in place.” In place means that the change you are making to the dataframe is made to the dataframe variable you describe. So for instance if you have a dataframe stored in a variable named df and you call df.drop([1]) it will drop the row corresponding to the index value 1. However, if you look at df again using, for instance, df.head() you will see that it will not have changed. To save the changes with a “not in place” operation you need to store the results in a new variable. For insteance df2 = df.drop([1]) will make a copy of df, drop the row corresponding to index 1 and then store the result in df2. If drop was an “in place” operation it would have actually modified df and so you wouldn’t need to store the result in a new variable.

Here is how to use it. Suppose we have the salary dataframe from above:

df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

If we want to delete the first we can drop it using the index it has, in this case 0:

df2=df.drop([0])
df2.head()
salary gender departm years age publications
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23
5 64451 0 bio 23.0 60.0 44

Notice how I “saved” the modified dataframe into a new variable called df2 using the equals sign. Then if we call the .head() method on df2 we can see that the first row has been removed. df, the original dataframe variable, remains unchanged.

You can also remove multiple rows by their index value at once:

df.drop([0,2,4,6]).head()
salary gender departm years age publications
1 77125 0 bio 28.0 58.0 43
3 70499 0 bio 16.0 46.0 64
5 64451 0 bio 23.0 60.0 44
7 59344 0 bio 5.0 40.0 11
8 58560 0 bio 8.0 38.0 8

Here I dropped the rows with index 0,2,4,6, and also show an example of chaining dataframe methods.

Chaining methods

Dataframes in pandas a what is known as an object-oriented structure. This means that most the functionality of a dataframe is tied to the variables themselves. That is why in the previous examples we called df.method() like df.drop() instead of pd.drop() (calling from the base pandas library). Given this, most pandas methods either return themselves or a copy of the dataframe that has been altered. Thus you can “chain” operations and methods together to make the code more concise. Chaining means calling multiple methods in a row on a single line of code. df.drop([0,2,4,6]).head() calls the drop() method on the dataframe and then called the head() method on the resulting modified data frame. This means you don’t have to store every intermediate step of your data manipulation into a new variable.

6.9.2. Deleting columns#

To delete a column instead of a row you can also use the .drop() method, using an additional argument that refers to the “axis” you are dropping from.

df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

Here are a couple examples of dropping one or more columns by name. Note that the case of the column name must match and also you need to specific axis=1 to refer to dropping columns instead of rows.

df.drop('years',axis=1).head()
salary gender departm age publications
0 86285 0 bio 64.0 72
1 77125 0 bio 58.0 43
2 71922 0 bio 38.0 23
3 70499 0 bio 46.0 64
4 66624 0 bio 41.0 23
df.drop(['years','age'],axis=1).head()
salary gender departm publications
0 86285 0 bio 72
1 77125 0 bio 43
2 71922 0 bio 23
3 70499 0 bio 64
4 66624 0 bio 23

6.9.3. Adding rows#

Using pandas it is best to resist the urge to add rows one at a time to a dataframe. For various reasons is this not the ideal way to use pandas 1. Instead you can combine the rows of two different dataframes into one. This might be useful in psychology for instance if each participant in your experiment had their own data file and you want to read each file into a dataframe and them combine them together to make one “uber” dataframe with all the data from your experiment.

Here’s two simple dataframes and we can combine them:

df1 = pd.DataFrame({"age": [10,27,45,23,21], "salary": [0,23000,100000,35000,60000]})
df2 = pd.DataFrame({"age": [60,70,53,56,80], "salary": [50000,23000,60000,135000,0]})

df_combined = pd.concat([df1,df2])

df_combined
age salary
0 10 0
1 27 23000
2 45 100000
3 23 35000
4 21 60000
0 60 50000
1 70 23000
2 53 60000
3 56 135000
4 80 0

This only works because they have the same columns. If they have different columns then the missing entries of either are filled in with NaN which is the code for “missing values” in pandas.

df1 = pd.DataFrame({"age": [10,27,45,23,21], "salary": [0,23000,100000,35000,60000]})
df2 = pd.DataFrame({"age": [60,70,53,56,80], "height": [5.2,6.0,5.7,3.4,4.6]})

df_combined = pd.concat([df1,df2])

df_combined
age salary height
0 10 0.0 NaN
1 27 23000.0 NaN
2 45 100000.0 NaN
3 23 35000.0 NaN
4 21 60000.0 NaN
0 60 NaN 5.2
1 70 NaN 6.0
2 53 NaN 5.7
3 56 NaN 3.4
4 80 NaN 4.6

We will talk about dealing with “missing” values shortly but basically missing values in pandas allows for incomplete rows: you might have have information about every single field of a row and so you can uses NaN (stands for Not-a-number in computer speak) to represent missing values.

6.9.4. Adding columns#

As we have considered adding rows, now let’s consider adding columns. This is actually pretty easy. You just assign some values to a new columns name. First we will create a data frame with two random columns:

df = pd.DataFrame({"col_1": np.random.rand(10), "col_2": np.random.rand(10)})
df
col_1 col_2
0 0.016417 0.555351
1 0.323619 0.868004
2 0.744349 0.767006
3 0.023376 0.915717
4 0.431546 0.761252
5 0.668849 0.522965
6 0.199476 0.161559
7 0.071968 0.866112
8 0.452259 0.650265
9 0.449017 0.797710

Now we simply assign a new column df[\'sum\'] and define it to be the sum of two columns.

df['sum'] = df['col_1']+df['col_2']
df
col_1 col_2 sum
0 0.016417 0.555351 0.571768
1 0.323619 0.868004 1.191623
2 0.744349 0.767006 1.511355
3 0.023376 0.915717 0.939093
4 0.431546 0.761252 1.192798
5 0.668849 0.522965 1.191814
6 0.199476 0.161559 0.361035
7 0.071968 0.866112 0.938080
8 0.452259 0.650265 1.102524
9 0.449017 0.797710 1.246726

You can also define new columns to be a constant value:

df['constant'] = 2
df
col_1 col_2 sum constant
0 0.189631 0.014450 0.204081 2
1 0.455379 0.781008 1.236386 2
2 0.879715 0.426790 1.306504 2
3 0.942081 0.905498 1.847578 2
4 0.189411 0.631864 0.821275 2
5 0.813499 0.069108 0.882607 2
6 0.884211 0.467572 1.351783 2
7 0.232354 0.593110 0.825464 2
8 0.762218 0.176046 0.938265 2
9 0.872711 0.053765 0.926476 2

There are of course some limitations and technicalities here but for the most part you just name a new column and define it as above.

6.9.5. Deleting rows with missing data#

Since we are talking about adding and removing rows and columns it can also make sense to discuss removing rows with missing data. You might want to for example drop any trial from an experiment where a subject didn’t give a response before a timer deadline. In this case the column coding what the response was might be “empty” or “missing” and you would want to use the NaN value to indicate it was missing. To totally delete rows with any missing value use the dropna() function:

df = pd.DataFrame({"age": [10,27,45,23,21], "salary": [0,23000,None,35000,60000]})
df
age salary
0 10 0.0
1 27 23000.0
2 45 NaN
3 23 35000.0
4 21 60000.0

As you can see the salary value is missing for row with index 2 (age 45). To drop any row with a missing value:

df.dropna()
age salary
0 10 0.0
1 27 23000.0
3 23 35000.0
4 21 60000.0

There are several other tricks for dropping missing data besides this. For example, you can delete rows with only a specific column value missing, etc… However for now this should work for us.

6.9.6. Getting individual values and changing them#

Sometimes you want to just extract a single value or row from a dataframe or change a single value. The .iat method lets you pull out a single value from the data frame given the position of the index and column. The .iat method is one of the very weird parts of Pandas which has no real good explanation. Instead of the parentheses used to call normal methods, .iat and a few others use square brackets. One way to think about it is that .iat is like looking something up in a normal Python array but it also seems like a method attached to the dataframe.

df = pd.read_csv('http://gureckislab.org/courses/fall19/labincp/data/salary.csv')
df
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23
... ... ... ... ... ... ...
72 53662 1 neuro 1.0 31.0 3
73 57185 1 stat 9.0 39.0 7
74 52254 1 stat 2.0 32.0 9
75 61885 1 math 23.0 60.0 9
76 49542 1 math 3.0 33.0 5

77 rows × 6 columns

This gets the value at row 0 column 0 which is the salary of the first person in the data frame.

df.iat[0,0]
86285

This gets the age of the person in the sixth row, fourth column.

df.iat[6,4]
53.0

A slighly more reader-friendly option is to use .at[] which is a method that lets you look things up using the names of columns.

df.at[10, 'age']
40.0

You can also use this to change the value of a particular cell.

First we see that the person in the first row has age 64.

df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

Using .at[] we set it to 100.

df.at[0,'age']=100

Then when we look at the result the age has been changed in the dataframe df.

df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 100.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

The way to remember this is that methods that look things up use square brackets. And if the method begins with an i (like .iat[]) it will look it up by integer index (i.e., the number of the column or row). Otherwise .at[] looks up by row (or a named index) and column name.

6.9.7. Checking the size/dimensions of your data#

One common thing you need to do is to verify the number of rows and columns in your dataframe. The .shape property can help with that.

df = pd.read_csv('http://gureckislab.org/courses/fall19/labincp/data/salary.csv')
df.shape
(77, 6)

This tells us that the dataframe contained in variable df has 77 rows and 6 columns. This is often helpful when you first read in a dataset to verify it has all the data you expected to be there!

Methods versus properties

The .shape property doesn’t include a final () unlike other methods we have learned about like .drop() which required parentheses. This reflects that size is known as a property of the dataframe while .drop() is a method. The conceptual differences can be confusing for why one is one way or the other. However, it is helpful to often think about them as the distinction between nouns and verbs in langauge. Properties (nouns) are static descriptors of a dataset such as the size of the dataset or the column names. In contrast, methods (verbs) are things that require computation or modification of the dataframe like deleting things or performing computations on columns or rows.

Ultimately the step we just covered recreate much of what you do with the graphical user interface in Excel (change cell values, add/delete rows and columns, etc…). The real power of Pandas comes from more complex things you can do with dataframes. That is what we will explore in the next section.

6.10. Things you can do with dataframes#

The main goal of getting your data into a dataframe is that enables several methods for manipulating your data in powerful ways.

6.10.1. Sorting#

Often times it can help us understand a dataset better if we can sort the rows of the dataset according to the values in one or more columns. For instance in the salary data set we have been considering it is hard to know who is the highest and lowest paid faculty. One approach would to be sort the values.

df = pd.read_csv('http://gureckislab.org/courses/fall19/labincp/data/salary.csv')
df.head()
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

We can sort this dataset in ascending order with:

df.sort_values('salary')
salary gender departm years age publications
23 44687 0 chem 4.0 34.0 19
22 47021 0 chem 4.0 34.0 12
76 49542 1 math 3.0 33.0 5
53 51391 0 stat 5.0 35.0 8
74 52254 1 stat 2.0 32.0 9
... ... ... ... ... ... ...
14 97630 0 chem 34.0 64.0 43
24 104828 0 geol NaN 50.0 44
29 105761 0 neuro 9.0 39.0 30
41 106412 0 stat 23.0 53.0 29
28 112800 0 neuro 14.0 44.0 33

77 rows × 6 columns

Now we can easily see from this output at 44,687 is the lowest salary and 112,800 is the highest. sort_values() is not an inplace operation so the original dataframe is still unsorted and we have to store the sorted result in a new dataframe variable if we want to keep working with it.

df.head()  # still unsorted
salary gender departm years age publications
0 86285 0 bio 26.0 64.0 72
1 77125 0 bio 28.0 58.0 43
2 71922 0 bio 10.0 38.0 23
3 70499 0 bio 16.0 46.0 64
4 66624 0 bio 11.0 41.0 23

We can sort the other way by adding an additional parameter:

df.sort_values('salary',ascending=False)
salary gender departm years age publications
28 112800 0 neuro 14.0 44.0 33
41 106412 0 stat 23.0 53.0 29
29 105761 0 neuro 9.0 39.0 30
24 104828 0 geol NaN 50.0 44
14 97630 0 chem 34.0 64.0 43
... ... ... ... ... ... ...
74 52254 1 stat 2.0 32.0 9
53 51391 0 stat 5.0 35.0 8
76 49542 1 math 3.0 33.0 5
22 47021 0 chem 4.0 34.0 12
23 44687 0 chem 4.0 34.0 19

77 rows × 6 columns

And if you sort by two columns it will do them in order (so the first listed column is sorted first then the second):

df.sort_values(['salary','age'])
salary gender departm years age publications
23 44687 0 chem 4.0 34.0 19
22 47021 0 chem 4.0 34.0 12
76 49542 1 math 3.0 33.0 5
53 51391 0 stat 5.0 35.0 8
74 52254 1 stat 2.0 32.0 9
... ... ... ... ... ... ...
14 97630 0 chem 34.0 64.0 43
24 104828 0 geol NaN 50.0 44
29 105761 0 neuro 9.0 39.0 30
41 106412 0 stat 23.0 53.0 29
28 112800 0 neuro 14.0 44.0 33

77 rows × 6 columns

In this data set it is mostly the same to sort by salary first then age because most people don’t have the same salary so that already provides an order. However if we do it the other way, i.e., age first then salary, it will order by people age and then for the people who are the same age sort by salary.

df.sort_values(['age','salary'])
salary gender departm years age publications
72 53662 1 neuro 1.0 31.0 3
74 52254 1 stat 2.0 32.0 9
52 53656 0 stat 2.0 32.0 4
56 72044 0 physics 2.0 32.0 16
76 49542 1 math 3.0 33.0 5
... ... ... ... ... ... ...
15 82444 0 chem 31.0 61.0 42
0 86285 0 bio 26.0 64.0 72
14 97630 0 chem 34.0 64.0 43
16 76291 0 chem 29.0 65.0 33
18 64762 0 chem 25.0 NaN 29

77 rows × 6 columns

As you can see in this shortened output there are several people who are 32 in the database and their salaries are ordered from smallest to biggest.

Sorting is easy to do in Pandas but also easy to do in Excel because there is a “sort” tool in such programs.

6.10.2. Arithmetic#

Perhaps one of the most useful features of dataframes (and spreadsheets) is the ability to create formulas that compute new values based on the rows and columns. For instance if you had a dataframe that had rows for students and each column was the grade on an assignment a common operation might be to compute the average grade as a new column. Let’s take a look at a simple example of this and then discuss arithmetic operations in Pandas more generally.

grades_df = pd.DataFrame({'student':['001','002','003'], 'assignment1': [90, 80, 70], 'assignment2': [82,84,96], 'assignment3': [89,75,89]})
grades_df
student assignment1 assignment2 assignment3
0 001 90 82 89
1 002 80 84 75
2 003 70 96 89

This is not necessarily the easier way to enter this data (you might prefer to use a spreadsheet for that), but you could read in a csv to load the grades for instance. Next you would want to create the average grade for each student.

grades_df['average']=(grades_df['assignment1']+grades_df['assignment2']+grades_df['assignment3'])/3
grades_df
student assignment1 assignment2 assignment3 average
0 001 90 82 89 87.000000
1 002 80 84 75 79.666667
2 003 70 96 89 85.000000

So you can see here we added up the column for assignment 1, 2, and 3 and then divided by three. Then we assigned that resulting value to a new column called average. You might wonder how did Pandas know to do this for all three students? The answer is the power of broadcasting a feature of many programming languages that automatically detects when you are doing arithmetic operations on collections of numbers and then does that operation for each entry rather than like the first one.

We can also broadcast the addition of constant values to a column. For instance to give all the students a five point bonus we could do this”

grades_df['average']=grades_df['average']+5
grades_df
student assignment1 assignment2 assignment3 average
0 001 90 82 89 92.000000
1 002 80 84 75 84.666667
2 003 70 96 89 90.000000

Again, here it added 5 to each entry of the grades column rather than just one or the first row.

Basically any math function can be composed of the columns. You might also be interested in functions you could compute down the columns rather than across them, however we will consider those in more detail in the later chapter on Describing Data.

6.10.3. Slicing#

A very useful and common feature for manipulating data is slicing. Slicing refers to selecting out subsets of a dataset for further analysis. For example we might want to plot only the salaries of the women in this data set. To do this we want to “slice” out a subset of the data and analyze that further. We saw slicing before in a previous chapter on Python programming with strings and lists where you can “slice” pieces of a collection out. Pandas takes this several steps further making it more functional for data composed of multiple rows and columns.

If you remember we can slice a piece of a list by specifying the starting element and the ending element and including the : operator to indicate that we want the values “in between”:

a_list = ['a','b','c','d','e','f','g']
a_list[1:4]
['b', 'c', 'd']

We can do the same thing using the .iloc[] method on a data frame. However since a Pandas dataframe is two dimensional we can specify two slide ranges one for the rows and one for the columns. iloc[] is simialr to iat[] we say above in that it is used via square brackets (because it is a lookup indexing operation) and the i part refer to that we are providing the integer location of the rows and columns (i.e., using numbers to say which row and columns we want).

df.iloc[2:4,0:3]
salary gender departm
2 71922 0 bio
3 70499 0 bio

That example takes row 2 and 3 and columns 0,1,2 (remember Python is zero indexed so the first row or column is numbered zero).

We can also slice using column names using just loc[]:

df.loc[1:2,"gender":]
gender departm years age publications
1 0 bio 28.0 58.0 43
2 0 bio 10.0 38.0 23

Similar to .iat[] and .at[], iloc[] uses integer lookup and loc[] uses index names. You might wonder why I still used numbers for the rows in the above. This is because this dataframe still has an index for the rows which is integer based. However, indexes are an important part of Dataframes.

6.10.4. Selecting#

Related to slicing is “selecting” which is grabbing subsets of a dataframe’s rows based on the values of some of the rows. This is different than slicing which takes little chunks out of a larger dataframe using indexes or column names. Here we are interested in selecting rows that meet a particular criterion. For instance in the professor salary dataset we read in we might want to select only the rows that represent women in the dataset. Let’s look at an example of how to do that and then discuss the principle.

First we need to discuss the concept of logical operations which are broadcast along the columns. For instance if we write something like

df['age']>50
0      True
1      True
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75     True
76    False
Name: age, Length: 77, dtype: bool

We get a column of True/False values (also known as Boolean values since they take on two values) which reflect a test for each row if the age value is greater than 50. If it is, then True is entered into the new column and if it isn’t then False is entered in.

We can write more complex logical operations as well. For instance:

(df['age']>50) & (df['age']<70)
0      True
1      True
2     False
3     False
4     False
      ...  
72    False
73    False
74    False
75     True
76    False
Name: age, Length: 77, dtype: bool

This expression does a logical and due to the & symbol and will be true if the age is greater than 50 AND less than 70.

The examples so far used a single row but we can also make combination using multiple columns. For instance we could select all the rows corresponding to professors that are male and under 50.

(df['age']<50) & (df['gender']==1)
0     False
1     False
2     False
3     False
4     False
      ...  
72     True
73     True
74     True
75    False
76     True
Length: 77, dtype: bool

If you want to make an “or” you use the ‘|’ (pipe) character instead of the ‘&’.

Now that we have this boolean column we can use it to select subsets of the original dataframe:

df[df['age']<35]
salary gender departm years age publications
22 47021 0 chem 4.0 34.0 12
23 44687 0 chem 4.0 34.0 19
52 53656 0 stat 2.0 32.0 4
56 72044 0 physics 2.0 32.0 16
70 55949 1 chem 4.0 34.0 12
72 53662 1 neuro 1.0 31.0 3
74 52254 1 stat 2.0 32.0 9
76 49542 1 math 3.0 33.0 5

The previous line selects all the professors under 35. Notice the syntax here as it is kind of sensible. On the outer part we have df[] and in the middle of the bracket we provide the logical column/series as we just discussed. You can break them into two steps if you like:

under35=df['age']<35
df[under35]
salary gender departm years age publications
22 47021 0 chem 4.0 34.0 12
23 44687 0 chem 4.0 34.0 19
52 53656 0 stat 2.0 32.0 4
56 72044 0 physics 2.0 32.0 16
70 55949 1 chem 4.0 34.0 12
72 53662 1 neuro 1.0 31.0 3
74 52254 1 stat 2.0 32.0 9
76 49542 1 math 3.0 33.0 5

This makes clear that we define the “rule” for what things are true and false in this column (under35) and then use it to select rows from the original dataframe.

You use this a lot in data analysis because often a data file from a single subject in an experiment has trials you want to skip or analyze, or you might use it to select trials from particular subjects, or trials that meet a certain requirement (E.g., if a reaction time was too long or something). Thus it is important to bookmark this concept and we will return to it several times throughout the semester.

An event simpler synatx for this is provided by the .query() method on Pandas data frames. For example, using this command this is how you would select all the professors under age 35 in the the salary dataset:

df.query('age<35')
salary gender departm years age publications
22 47021 0 chem 4.0 34.0 12
23 44687 0 chem 4.0 34.0 19
52 53656 0 stat 2.0 32.0 4
56 72044 0 physics 2.0 32.0 16
70 55949 1 chem 4.0 34.0 12
72 53662 1 neuro 1.0 31.0 3
74 52254 1 stat 2.0 32.0 9
76 49542 1 math 3.0 33.0 5

This is nice because it does the same thing as the above without all the extra brackets which sometimes contribute to typos. Here is a slightly more complex version using the logical and operator:

df.query("age<35 & departm=='chem'")
salary gender departm years age publications
22 47021 0 chem 4.0 34.0 12
23 44687 0 chem 4.0 34.0 19
70 55949 1 chem 4.0 34.0 12
df.query("age<35 & departm=='chem' & publications == 12")
salary gender departm years age publications
22 47021 0 chem 4.0 34.0 12
70 55949 1 chem 4.0 34.0 12

So nice!

6.10.5. Iteration#

Iteration refers to stepping through either the rows of the columns of your dataframe one by one and doing some thing with each row or column. We have encountered iteration before when we learned about Python for loops. If you remember, the typical for loop we had we iterated over a list. For example this code iterates down a list of values and prints each one.

for i in range(10):
    print(i)
0
1
2
3
4
5
6
7
8
9

We can iterate over both rows and columns of a dataframe since the format is two dimensional. This prints out the titles of each column.

for column in df:
    print(column)
salary
gender
departm
years
age
publications

The above method only gets you the individual column name. To get the data within each column, you use a special methods called .iteritems(). For example:

for columnname, columndata in df.iteritems():
    print(columnname)
    print('---')
    print(columndata)
salary
---
0     86285
1     77125
2     71922
3     70499
4     66624
      ...  
72    53662
73    57185
74    52254
75    61885
76    49542
Name: salary, Length: 77, dtype: int64
gender
---
0     0
1     0
2     0
3     0
4     0
     ..
72    1
73    1
74    1
75    1
76    1
Name: gender, Length: 77, dtype: int64
departm
---
0       bio
1       bio
2       bio
3       bio
4       bio
      ...  
72    neuro
73     stat
74     stat
75     math
76     math
Name: departm, Length: 77, dtype: object
years
---
0     26.0
1     28.0
2     10.0
3     16.0
4     11.0
      ... 
72     1.0
73     9.0
74     2.0
75    23.0
76     3.0
Name: years, Length: 77, dtype: float64
age
---
0     64.0
1     58.0
2     38.0
3     46.0
4     41.0
      ... 
72    31.0
73    39.0
74    32.0
75    60.0
76    33.0
Name: age, Length: 77, dtype: float64
publications
---
0     72
1     43
2     23
3     64
4     23
      ..
72     3
73     7
74     9
75     9
76     5
Name: publications, Length: 77, dtype: int64

Finally, if you want to step through row-by-row use the .iterrows() method.

for row in df.iterrows():
    print(row)
(0, salary          86285
gender              0
departm           bio
years              26
age                64
publications       72
Name: 0, dtype: object)
(1, salary          77125
gender              0
departm           bio
years              28
age                58
publications       43
Name: 1, dtype: object)
(2, salary          71922
gender              0
departm           bio
years              10
age                38
publications       23
Name: 2, dtype: object)
(3, salary          70499
gender              0
departm           bio
years              16
age                46
publications       64
Name: 3, dtype: object)
(4, salary          66624
gender              0
departm           bio
years              11
age                41
publications       23
Name: 4, dtype: object)
(5, salary          64451
gender              0
departm           bio
years              23
age                60
publications       44
Name: 5, dtype: object)
(6, salary          64366
gender              0
departm           bio
years              23
age                53
publications       22
Name: 6, dtype: object)
(7, salary          59344
gender              0
departm           bio
years               5
age                40
publications       11
Name: 7, dtype: object)
(8, salary          58560
gender              0
departm           bio
years               8
age                38
publications        8
Name: 8, dtype: object)
(9, salary          58294
gender              0
departm           bio
years              20
age                50
publications       12
Name: 9, dtype: object)
(10, salary          56092
gender              0
departm           bio
years               2
age                40
publications        4
Name: 10, dtype: object)
(11, salary          54452
gender              0
departm           bio
years              13
age                43
publications        7
Name: 11, dtype: object)
(12, salary          54269
gender              0
departm           bio
years              26
age                56
publications       12
Name: 12, dtype: object)
(13, salary          55125
gender              0
departm           bio
years               8
age                38
publications        9
Name: 13, dtype: object)
(14, salary          97630
gender              0
departm          chem
years              34
age                64
publications       43
Name: 14, dtype: object)
(15, salary          82444
gender              0
departm          chem
years              31
age                61
publications       42
Name: 15, dtype: object)
(16, salary          76291
gender              0
departm          chem
years              29
age                65
publications       33
Name: 16, dtype: object)
(17, salary          75382
gender              0
departm          chem
years              26
age                56
publications       39
Name: 17, dtype: object)
(18, salary          64762
gender              0
departm          chem
years              25
age               NaN
publications       29
Name: 18, dtype: object)
(19, salary          62607
gender              0
departm          chem
years              20
age                45
publications       34
Name: 19, dtype: object)
(20, salary          60373
gender              0
departm          chem
years              26
age                56
publications       43
Name: 20, dtype: object)
(21, salary          58892
gender              0
departm          chem
years              18
age                48
publications       21
Name: 21, dtype: object)
(22, salary          47021
gender              0
departm          chem
years               4
age                34
publications       12
Name: 22, dtype: object)
(23, salary          44687
gender              0
departm          chem
years               4
age                34
publications       19
Name: 23, dtype: object)
(24, salary          104828
gender               0
departm           geol
years              NaN
age                 50
publications        44
Name: 24, dtype: object)
(25, salary          71456
gender              0
departm          geol
years              11
age                41
publications       32
Name: 25, dtype: object)
(26, salary          65144
gender              0
departm          geol
years               7
age                37
publications       12
Name: 26, dtype: object)
(27, salary          52766
gender              0
departm          geol
years               4
age                38
publications       32
Name: 27, dtype: object)
(28, salary          112800
gender               0
departm          neuro
years               14
age                 44
publications        33
Name: 28, dtype: object)
(29, salary          105761
gender               0
departm          neuro
years                9
age                 39
publications        30
Name: 29, dtype: object)
(30, salary          92951
gender              0
departm         neuro
years              11
age                41
publications       20
Name: 30, dtype: object)
(31, salary          86621
gender              0
departm         neuro
years              19
age                49
publications       10
Name: 31, dtype: object)
(32, salary          85569
gender              0
departm         neuro
years              20
age                46
publications       35
Name: 32, dtype: object)
(33, salary          83896
gender              0
departm         neuro
years              10
age                40
publications       22
Name: 33, dtype: object)
(34, salary          79735
gender              0
departm         neuro
years              11
age                41
publications       32
Name: 34, dtype: object)
(35, salary          71518
gender              0
departm         neuro
years               7
age                37
publications       34
Name: 35, dtype: object)
(36, salary          68029
gender              0
departm         neuro
years              15
age                45
publications       33
Name: 36, dtype: object)
(37, salary          66482
gender              0
departm         neuro
years              14
age                44
publications       42
Name: 37, dtype: object)
(38, salary          61680
gender              0
departm         neuro
years              18
age                48
publications       20
Name: 38, dtype: object)
(39, salary          60455
gender              0
departm         neuro
years               8
age                38
publications       49
Name: 39, dtype: object)
(40, salary          58932
gender              0
departm         neuro
years              11
age                41
publications       49
Name: 40, dtype: object)
(41, salary          106412
gender               0
departm           stat
years               23
age                 53
publications        29
Name: 41, dtype: object)
(42, salary          86980
gender              0
departm          stat
years              23
age                53
publications       42
Name: 42, dtype: object)
(43, salary          78114
gender              0
departm          stat
years               8
age                38
publications       24
Name: 43, dtype: object)
(44, salary          74085
gender              0
departm          stat
years              11
age                41
publications       33
Name: 44, dtype: object)
(45, salary          72250
gender              0
departm          stat
years              26
age                56
publications        9
Name: 45, dtype: object)
(46, salary          69596
gender              0
departm          stat
years              20
age                50
publications       18
Name: 46, dtype: object)
(47, salary          65285
gender              0
departm          stat
years              20
age                50
publications       15
Name: 47, dtype: object)
(48, salary          62557
gender              0
departm          stat
years              28
age                58
publications       14
Name: 48, dtype: object)
(49, salary          61947
gender              0
departm          stat
years              22
age                58
publications       17
Name: 49, dtype: object)
(50, salary          58565
gender              0
departm          stat
years              29
age                59
publications       11
Name: 50, dtype: object)
(51, salary          58365
gender              0
departm          stat
years              18
age                48
publications       21
Name: 51, dtype: object)
(52, salary          53656
gender              0
departm          stat
years               2
age                32
publications        4
Name: 52, dtype: object)
(53, salary          51391
gender              0
departm          stat
years               5
age                35
publications        8
Name: 53, dtype: object)
(54, salary            96936
gender                0
departm         physics
years                15
age                  50
publications         17
Name: 54, dtype: object)
(55, salary            83216
gender                0
departm         physics
years                11
age                  37
publications         19
Name: 55, dtype: object)
(56, salary            72044
gender                0
departm         physics
years                 2
age                  32
publications         16
Name: 56, dtype: object)
(57, salary            64048
gender                0
departm         physics
years                23
age                  53
publications          4
Name: 57, dtype: object)
(58, salary            58888
gender                0
departm         physics
years                26
age                  56
publications          7
Name: 58, dtype: object)
(59, salary            58744
gender                0
departm         physics
years                20
age                  50
publications          9
Name: 59, dtype: object)
(60, salary            55944
gender                0
departm         physics
years                21
age                  51
publications          8
Name: 60, dtype: object)
(61, salary            54076
gender                0
departm         physics
years                19
age                  49
publications         12
Name: 61, dtype: object)
(62, salary          82142
gender              0
departm          math
years               9
age                39
publications        9
Name: 62, dtype: object)
(63, salary          70509
gender              0
departm          math
years              23
age                53
publications        7
Name: 63, dtype: object)
(64, salary          60320
gender              0
departm          math
years              14
age                44
publications        7
Name: 64, dtype: object)
(65, salary          55814
gender              0
departm          math
years               8
age                38
publications        6
Name: 65, dtype: object)
(66, salary          53638
gender              0
departm          math
years               4
age                42
publications        8
Name: 66, dtype: object)
(67, salary          53517
gender              2
departm          math
years               5
age                35
publications        5
Name: 67, dtype: object)
(68, salary          59139
gender              1
departm           bio
years               8
age                38
publications       23
Name: 68, dtype: object)
(69, salary          52968
gender              1
departm           bio
years              18
age                48
publications       32
Name: 69, dtype: object)
(70, salary          55949
gender              1
departm          chem
years               4
age                34
publications       12
Name: 70, dtype: object)
(71, salary          58893
gender              1
departm         neuro
years              10
age                35
publications        4
Name: 71, dtype: object)
(72, salary          53662
gender              1
departm         neuro
years               1
age                31
publications        3
Name: 72, dtype: object)
(73, salary          57185
gender              1
departm          stat
years               9
age                39
publications        7
Name: 73, dtype: object)
(74, salary          52254
gender              1
departm          stat
years               2
age                32
publications        9
Name: 74, dtype: object)
(75, salary          61885
gender              1
departm          math
years              23
age                60
publications        9
Name: 75, dtype: object)
(76, salary          49542
gender              1
departm          math
years               3
age                33
publications        5
Name: 76, dtype: object)

A substantially faster version of .iterrows() is .itertuples() which returns a slightly different representation of each row:

for row in df.itertuples():
    print(row)
Pandas(Index=0, salary=86285, gender=0, departm='bio', years=26.0, age=64.0, publications=72)
Pandas(Index=1, salary=77125, gender=0, departm='bio', years=28.0, age=58.0, publications=43)
Pandas(Index=2, salary=71922, gender=0, departm='bio', years=10.0, age=38.0, publications=23)
Pandas(Index=3, salary=70499, gender=0, departm='bio', years=16.0, age=46.0, publications=64)
Pandas(Index=4, salary=66624, gender=0, departm='bio', years=11.0, age=41.0, publications=23)
Pandas(Index=5, salary=64451, gender=0, departm='bio', years=23.0, age=60.0, publications=44)
Pandas(Index=6, salary=64366, gender=0, departm='bio', years=23.0, age=53.0, publications=22)
Pandas(Index=7, salary=59344, gender=0, departm='bio', years=5.0, age=40.0, publications=11)
Pandas(Index=8, salary=58560, gender=0, departm='bio', years=8.0, age=38.0, publications=8)
Pandas(Index=9, salary=58294, gender=0, departm='bio', years=20.0, age=50.0, publications=12)
Pandas(Index=10, salary=56092, gender=0, departm='bio', years=2.0, age=40.0, publications=4)
Pandas(Index=11, salary=54452, gender=0, departm='bio', years=13.0, age=43.0, publications=7)
Pandas(Index=12, salary=54269, gender=0, departm='bio', years=26.0, age=56.0, publications=12)
Pandas(Index=13, salary=55125, gender=0, departm='bio', years=8.0, age=38.0, publications=9)
Pandas(Index=14, salary=97630, gender=0, departm='chem', years=34.0, age=64.0, publications=43)
Pandas(Index=15, salary=82444, gender=0, departm='chem', years=31.0, age=61.0, publications=42)
Pandas(Index=16, salary=76291, gender=0, departm='chem', years=29.0, age=65.0, publications=33)
Pandas(Index=17, salary=75382, gender=0, departm='chem', years=26.0, age=56.0, publications=39)
Pandas(Index=18, salary=64762, gender=0, departm='chem', years=25.0, age=nan, publications=29)
Pandas(Index=19, salary=62607, gender=0, departm='chem', years=20.0, age=45.0, publications=34)
Pandas(Index=20, salary=60373, gender=0, departm='chem', years=26.0, age=56.0, publications=43)
Pandas(Index=21, salary=58892, gender=0, departm='chem', years=18.0, age=48.0, publications=21)
Pandas(Index=22, salary=47021, gender=0, departm='chem', years=4.0, age=34.0, publications=12)
Pandas(Index=23, salary=44687, gender=0, departm='chem', years=4.0, age=34.0, publications=19)
Pandas(Index=24, salary=104828, gender=0, departm='geol', years=nan, age=50.0, publications=44)
Pandas(Index=25, salary=71456, gender=0, departm='geol', years=11.0, age=41.0, publications=32)
Pandas(Index=26, salary=65144, gender=0, departm='geol', years=7.0, age=37.0, publications=12)
Pandas(Index=27, salary=52766, gender=0, departm='geol', years=4.0, age=38.0, publications=32)
Pandas(Index=28, salary=112800, gender=0, departm='neuro', years=14.0, age=44.0, publications=33)
Pandas(Index=29, salary=105761, gender=0, departm='neuro', years=9.0, age=39.0, publications=30)
Pandas(Index=30, salary=92951, gender=0, departm='neuro', years=11.0, age=41.0, publications=20)
Pandas(Index=31, salary=86621, gender=0, departm='neuro', years=19.0, age=49.0, publications=10)
Pandas(Index=32, salary=85569, gender=0, departm='neuro', years=20.0, age=46.0, publications=35)
Pandas(Index=33, salary=83896, gender=0, departm='neuro', years=10.0, age=40.0, publications=22)
Pandas(Index=34, salary=79735, gender=0, departm='neuro', years=11.0, age=41.0, publications=32)
Pandas(Index=35, salary=71518, gender=0, departm='neuro', years=7.0, age=37.0, publications=34)
Pandas(Index=36, salary=68029, gender=0, departm='neuro', years=15.0, age=45.0, publications=33)
Pandas(Index=37, salary=66482, gender=0, departm='neuro', years=14.0, age=44.0, publications=42)
Pandas(Index=38, salary=61680, gender=0, departm='neuro', years=18.0, age=48.0, publications=20)
Pandas(Index=39, salary=60455, gender=0, departm='neuro', years=8.0, age=38.0, publications=49)
Pandas(Index=40, salary=58932, gender=0, departm='neuro', years=11.0, age=41.0, publications=49)
Pandas(Index=41, salary=106412, gender=0, departm='stat', years=23.0, age=53.0, publications=29)
Pandas(Index=42, salary=86980, gender=0, departm='stat', years=23.0, age=53.0, publications=42)
Pandas(Index=43, salary=78114, gender=0, departm='stat', years=8.0, age=38.0, publications=24)
Pandas(Index=44, salary=74085, gender=0, departm='stat', years=11.0, age=41.0, publications=33)
Pandas(Index=45, salary=72250, gender=0, departm='stat', years=26.0, age=56.0, publications=9)
Pandas(Index=46, salary=69596, gender=0, departm='stat', years=20.0, age=50.0, publications=18)
Pandas(Index=47, salary=65285, gender=0, departm='stat', years=20.0, age=50.0, publications=15)
Pandas(Index=48, salary=62557, gender=0, departm='stat', years=28.0, age=58.0, publications=14)
Pandas(Index=49, salary=61947, gender=0, departm='stat', years=22.0, age=58.0, publications=17)
Pandas(Index=50, salary=58565, gender=0, departm='stat', years=29.0, age=59.0, publications=11)
Pandas(Index=51, salary=58365, gender=0, departm='stat', years=18.0, age=48.0, publications=21)
Pandas(Index=52, salary=53656, gender=0, departm='stat', years=2.0, age=32.0, publications=4)
Pandas(Index=53, salary=51391, gender=0, departm='stat', years=5.0, age=35.0, publications=8)
Pandas(Index=54, salary=96936, gender=0, departm='physics', years=15.0, age=50.0, publications=17)
Pandas(Index=55, salary=83216, gender=0, departm='physics', years=11.0, age=37.0, publications=19)
Pandas(Index=56, salary=72044, gender=0, departm='physics', years=2.0, age=32.0, publications=16)
Pandas(Index=57, salary=64048, gender=0, departm='physics', years=23.0, age=53.0, publications=4)
Pandas(Index=58, salary=58888, gender=0, departm='physics', years=26.0, age=56.0, publications=7)
Pandas(Index=59, salary=58744, gender=0, departm='physics', years=20.0, age=50.0, publications=9)
Pandas(Index=60, salary=55944, gender=0, departm='physics', years=21.0, age=51.0, publications=8)
Pandas(Index=61, salary=54076, gender=0, departm='physics', years=19.0, age=49.0, publications=12)
Pandas(Index=62, salary=82142, gender=0, departm='math', years=9.0, age=39.0, publications=9)
Pandas(Index=63, salary=70509, gender=0, departm='math', years=23.0, age=53.0, publications=7)
Pandas(Index=64, salary=60320, gender=0, departm='math', years=14.0, age=44.0, publications=7)
Pandas(Index=65, salary=55814, gender=0, departm='math', years=8.0, age=38.0, publications=6)
Pandas(Index=66, salary=53638, gender=0, departm='math', years=4.0, age=42.0, publications=8)
Pandas(Index=67, salary=53517, gender=2, departm='math', years=5.0, age=35.0, publications=5)
Pandas(Index=68, salary=59139, gender=1, departm='bio', years=8.0, age=38.0, publications=23)
Pandas(Index=69, salary=52968, gender=1, departm='bio', years=18.0, age=48.0, publications=32)
Pandas(Index=70, salary=55949, gender=1, departm='chem', years=4.0, age=34.0, publications=12)
Pandas(Index=71, salary=58893, gender=1, departm='neuro', years=10.0, age=35.0, publications=4)
Pandas(Index=72, salary=53662, gender=1, departm='neuro', years=1.0, age=31.0, publications=3)
Pandas(Index=73, salary=57185, gender=1, departm='stat', years=9.0, age=39.0, publications=7)
Pandas(Index=74, salary=52254, gender=1, departm='stat', years=2.0, age=32.0, publications=9)
Pandas(Index=75, salary=61885, gender=1, departm='math', years=23.0, age=60.0, publications=9)
Pandas(Index=76, salary=49542, gender=1, departm='math', years=3.0, age=33.0, publications=5)

6.11. Data organization - tidy and wide formats#

Now that you have seen how to import your data and the basic operations of dataframe, you might think that you’re now ready to start analyzing and visualizing your data—yet there’s still quite a bit of work that has to be done cleaning up your dataset before you do this. Some of this clean up might include fixing typos in your datasets, or filtering out participants with incomplete data. But even more important, you’ll need to make sure that your data is organized and structured in a consistent manner. One way of organizing your data, referred to as tidy format (a term coined by Hadley Wickham), is particularly helpful for facilitating data analysis.

To clarify what tidy format is, we will also talk about another common format of data called wide format that isn’t very tidy. While there are some reasons to use wide format over tidy format when programming (e.g., some functions in R require data to be in wide format), wide format is more commonly used when working with interface-based software (such as Excel and SPSS).

The clearest way of demonstrating the differences between wide versus tidy formats is by simply looking at datasets of both kinds. Take the example dataset below. It shows quiz scores for four participants across three timepoints plus some demographic data. It’s currently in wide format:

import pandas as pd 

data = [['tom', 26, "m", 12, 15, 15], 
        ['nick', 23, "m", 10, 9, 12], 
        ['julie', 18, "f", 15, 13, 14], 
        ['angela', 21, "f", 10, 10, 12]] 
df_wide = pd.DataFrame(data, columns = ['name', 'age', 'gender', 'time1', 'time2', 'time3']) 
df_wide 
name age gender time1 time2 time3
0 tom 26 m 12 15 15
1 nick 23 m 10 9 12
2 julie 18 f 15 13 14
3 angela 21 f 10 10 12

In wide format, each individual has their own row, and all data pertaining to that individual is contained within that row. For example, the first row has all of the data for Tom, including his scores across each of the three timepoints.

Now let’s compare this to tidy format:

df_tidy = pd.melt(df_wide, id_vars=['name', 'age', 'gender'], var_name='timepoint', value_name='score') 
df_tidy = df_tidy.sort_values(ascending=False, by='name')
df_tidy
name age gender timepoint score
0 tom 26 m time1 12
4 tom 26 m time2 15
8 tom 26 m time3 15
1 nick 23 m time1 10
5 nick 23 m time2 9
9 nick 23 m time3 12
2 julie 18 f time1 15
6 julie 18 f time2 13
10 julie 18 f time3 14
3 angela 21 f time1 10
7 angela 21 f time2 10
11 angela 21 f time3 12

In this format, now each observation has their own row. That is, we now have three separate rows for Tom, with each row reflecting a single quiz score from one of the three timepoints.

6.11.1. What is tidy data, and why use it?#

Pulling directly from Wickham’s book chapter on tidy data (his book is written for R users, but the principles of tidying data works the same across languages), there are three rules that work together to make a dataset tidy ([WG17]):

  1. Each observation must have its own row (observations could be each person, each timepoint, etc.)

  2. Each variable must have its own column (variables are some kind of measurement: gender, age, score, etc.)

  3. Each value must have its own cell (value are the actual measurement: female, 23 years, 12 points, etc.)

Above, a visual representation of each of the three rules for tidy data: variables are in columns, observations are in rows, and values are in cells. Image borrowed from R for Data Science.

So going back to our dataset from before, we can see that it follows each of these three rules. Each person and each timepoint has their own rows (Rule 1). Every type of data measured (gender, age, etc.) has their own column (Rule 2), and each cell within the dataframe has only one value (Rule 3).

Let’s look at some examples where one or more of these rules are being violated.

data_untidy = [['tom', "26_m", 12, 15, 15], 
              ['nick', "23_m", 10, 9, 12], 
              ['julie', "18_f", 15, 13, 14], 
              ['angela', "21_f", 10, 10, 12]] 
df_untidy = pd.DataFrame(data_untidy, columns = ['name', 'age_gender', 'time1', 'time2', 'time3'])
df_untidy
name age_gender time1 time2 time3
0 tom 26_m 12 15 15
1 nick 23_m 10 9 12
2 julie 18_f 15 13 14
3 angela 21_f 10 10 12

In the example above, this dataset is messy because two variables (age and gender) have been coded into the same column. Having multiple variables in the same column in this way makes it much harder to calculate the average age of participants or get a gender breakdown for the sample.

But this can easily be fixed by splitting the column into two.

df_untidy[['age', 'gender']] = df_untidy.age_gender.str.split("_", expand=True)
del df_untidy['age_gender']

df_untidy
name time1 time2 time3 age gender
0 tom 12 15 15 26 m
1 nick 10 9 12 23 m
2 julie 15 13 14 18 f
3 angela 10 10 12 21 f

Here’s another case of untidy data, similar to one we’ve seen before.

data_messy = [['tom', 26, 12, 15, 15, 17, 18, 20], 
              ['nick', 23, 10, 19, 12, 14, 11, 18], 
              ['julie', 18, 15, 23, 14, 18, 12, 19], 
              ['angela', 21, 10, 11, 12, 14, 15, 30]] 
df_messy = pd.DataFrame(data_messy, columns = ['name', 'age', 't1_min', 't1_max', 't2_min', 't2_max', 't3_min', 't3_max'])
df_messy
name age t1_min t1_max t2_min t2_max t3_min t3_max
0 tom 26 12 15 15 17 18 20
1 nick 23 10 19 12 14 11 18
2 julie 18 15 23 14 18 12 19
3 angela 21 10 11 12 14 15 30

This dataset violates the rules of a tidy dataset because a) there’s more than one observation in each row, and b) we have multiple columns reflecting the same variable (the distinction between a minimum and maximum score).

But we can fix it using pd.melt with a combination of few other things. In this tidied version, we now see that each variable (name, age, time, min score, and max score) all have their own columns.

df_messy = pd.melt(df_messy, id_vars=['name', 'age'], var_name='time_minmax', value_name='score')

df_messy[['time','minmax']] = df_messy.time_minmax.str.split("_", expand = True)
del df_messy['time_minmax']

df_messy = df_messy.pivot_table(index=['name','age','time'], columns='minmax', values='score')
df_messy.reset_index(drop=False, inplace=True)
df_messy
minmax name age time max min
0 angela 21 t1 11 10
1 angela 21 t2 14 12
2 angela 21 t3 30 15
3 julie 18 t1 23 15
4 julie 18 t2 18 14
5 julie 18 t3 19 12
6 nick 23 t1 19 10
7 nick 23 t2 14 12
8 nick 23 t3 18 11
9 tom 26 t1 15 12
10 tom 26 t2 17 15
11 tom 26 t3 20 18

Don’t worry if the actual code to make the data tidy doesn’t make sense right away (you will get there). We’ll talk more later about strategies you can use to wrangle data into tidy formats using Python. What you should be taking away from these examples is the differences in how these datasets are structured both before and after we tidied them.

Cleaning up and rearranging datasets is inevitable part of doing psychological research. Lots of data you work with may come to you in untidy formats, and you will have to clean that data up before you can analyze it. The good news is that knowing these principles of tidy, organized datasets will serve you well as you to start to design your experiments! The more you can build in tidy principles into how your data is recorded during the experiment, the more time you can save later when it comes to processing your data in Python.

Also importantly, this discussion highlights one major purpose of the Pandas library: namely to help you clean up and reorganize data. The commands like pd.melt() and pd.pivot_table() are pretty complex operations which alter your view of a dataset and make it easier to perform certain types of analysis. These reorganizations of data can be very difficult to do by hand but are facilitated by the dataframe library provided by Pandas.

6.11.2. A case example: organizing data for an psychological experiment#

The discussion so far might strike you as somewhat abstract. However, principles of data formatting and organization are very important for psychological research. Let’s consider the case most central to this course which is organizing a data file for a psychological experiment conducted on several participants composed of multiple trials and trial types. This is perhaps the most typical type of data format we will encounter in this work and is somewhat different from survey questions and other types of data which has a more complex structure often.

To keep things simple, let’s imagine that we are doing a decision making experiment. On each trial of the experiment the subject will be offered a choice between a gamble and a sure amount of money. For instance the subject might be presented with an option like “Would you prefer $10 now or a 50% chance to win 20 dollars otherwise nothing.” If you think about it a bit for this specific pair of options they on average give the same expected value because the expected value of the uncertain option is \(0.5*\$20 + 0.5*0 = \$10.\) If people were perfectly rational they should be indifferent between these two gambles. However, a common phenomena is that people show a biased called risk aversion where they prefer a certain reward over a risky gamble. Thus, even though on average you get the same, they do not want to take the gamble for the larger amount of money. There are several explanations of this but one intuitive one is that people don’t like regret and the feeling if you get nothing is so bad that makes people prefer the certain $10.

So imagine we do an experiment attempting to explore loss aversion. We might present people with several of these types of gambles across multiple trials. The reason is that we want more than one measurement and want to vary things like the probability of the higher reward on the risky gamble (e.g., instead of 50% chance we might want to explore 10% chance of winning). In addition, we might want to change the magnitudes of the rewards across different trials. Finally we will want to run multiple people so that our conclusions aren’t tied to one person. How should we structure our data file for this? You might think that we should first write the code for the experiment and then consider the data format. However, I find it is usually helpful to do the opposite. We start with the data format in mind right from the start.

Let’s look back at the key idea in a “tidy” dataset because we definitely want to be tidy here.

  1. Each observation must have its own row (observations could be each person, each timepoint, etc.)

  2. Each variable must have its own column (variables are some kind of measurement: gender, age, score, etc.)

  3. Each value must have its own cell (value are the actual measurement: female, 23 years, 12 points, etc.)

So each observation would be a trial and so it needs its own row.

Each variable would be information about the trial, both the details of what the subject saw on that trial (e.g., the magnitude of the rewards and gambles) and also how they responded (reaction time, what choice they made, etc…).

Each value must have its own cell means that each of these numbers need to have their own column and not be combined. So let’s make a pretend dataframe from scratch to illustrate this example.

So let’s imagine that there are 10 stimuli in the experiment. We can denote them as the value of the certain option and then the value of the high and low reward for the gamble. This means each stimulus/trial in the experiment can be described with four numbers:

  • certain value

  • upper risky value

  • lower risky value

  • probability of upper risky value

We can fix a few alternatives (let’s say enough for 4 trials) using a python dictionary:

stim={"certain": [10, 15, 25, 50], 
      "upper_risk": [20, 40, 30, 32], 
      "lower_risk": [0, 23, 29, 0], 
      "prob": [0.5, 0.5, 0.25, 0.75]}

Next we can pretend to generate some responses from one subject.

responses = {"reaction_time":[200, 400, 500, 600], 
             "choice": [0, 1, 1, 0]}

Now we can build up a dataframe for one subject. We will do this by first defining a python dictionary called trials which includes all the variables associated with one measurement (trial number, the value of the certain gamble, the probabilty, etc…) as well as all the measurements made on that trial (response time, choice). These start out at the beginning of the cell as empy lists and then a for loop below fills them in.

subject_id = "some_subject_id0"

trials = {'subject_id':[],
          'trial_num':[],
          'certain':[],
          'upper_risk':[],
          'lower_risk':[],
          'prob':[],
          'reaction_time':[],
          'choice':[]}

for i in range(4):
    trials['subject_id'].append(subject_id)
    trials['trial_num'].append(i)
    trials['certain'].append(stim['certain'][i])
    trials['upper_risk'].append(stim['upper_risk'][i])
    trials['lower_risk'].append(stim['lower_risk'][i])
    trials['prob'].append(stim['prob'][i])
    trials['reaction_time'].append(responses["reaction_time"][i])
    trials['choice'].append(responses["choice"][i])

trials
{'certain': [10, 15, 25, 50],
 'choice': [0, 1, 1, 0],
 'lower_risk': [0, 23, 29, 0],
 'prob': [0.5, 0.5, 0.25, 0.75],
 'reaction_time': [200, 400, 500, 600],
 'subject_id': ['some_subject_id0',
  'some_subject_id0',
  'some_subject_id0',
  'some_subject_id0'],
 'trial_num': [0, 1, 2, 3],
 'upper_risk': [20, 40, 30, 32]}

The result is a python dictionary that includes column names and then the data that goes in the columns which is perfect for initializing a Pandas dataframe:

df=pd.DataFrame(trials)
df
subject_id trial_num certain upper_risk lower_risk prob reaction_time choice
0 some_subject_id0 0 10 20 0 0.50 200 0
1 some_subject_id0 1 15 40 23 0.50 400 1
2 some_subject_id0 2 25 30 29 0.25 500 1
3 some_subject_id0 3 50 32 0 0.75 600 0

This is great and shows a very natural way to structure the data from an experiment. Each trial is a row in the dataframe. Each column either describes what the stimulus was on that trial or the response from the participant. In addition two columns code the trial number and the subject idea. This is critical for analysis as we might be interested in order effects on trials and so we need to know which trials were first or last. In addition we want to keep subjects separate. Eventually we will build up a more complex dataframe that includes multiple subjects. For example:

stim={"certain": [10, 15, 25, 50], 
      "upper_risk": [20, 40, 30, 32], 
      "lower_risk": [0, 23, 29, 0], 
      "prob": [0.5, 0.5, 0.25, 0.75]}

trials = {'subject_id':[],
          'trial_num':[],
          'certain':[],
          'upper_risk':[],
          'lower_risk':[],
          'prob':[],
          'reaction_time':[],
          'choice':[]}

def generate_subject_data(subject_id, trials):
    for i in range(4):
        trials['subject_id'].append(subject_id)
        trials['trial_num'].append(i)
        trials['certain'].append(stim['certain'][i])
        trials['upper_risk'].append(stim['upper_risk'][i])
        trials['lower_risk'].append(stim['lower_risk'][i])
        trials['prob'].append(stim['prob'][i])
        trials['reaction_time'].append(np.random.randint(200,1000))
        trials['choice'].append(np.random.randint(0,2))
        
for subj in ["some_subject_id0","some_subject_id1","some_subject_id2"]:
    generate_subject_data(subj,trials)
    
exp_df=pd.DataFrame(trials)
exp_df
subject_id trial_num certain upper_risk lower_risk prob reaction_time choice
0 some_subject_id0 0 10 20 0 0.50 993 1
1 some_subject_id0 1 15 40 23 0.50 587 0
2 some_subject_id0 2 25 30 29 0.25 626 1
3 some_subject_id0 3 50 32 0 0.75 835 0
4 some_subject_id1 0 10 20 0 0.50 679 0
5 some_subject_id1 1 15 40 23 0.50 205 1
6 some_subject_id1 2 25 30 29 0.25 798 0
7 some_subject_id1 3 50 32 0 0.75 742 1
8 some_subject_id2 0 10 20 0 0.50 729 1
9 some_subject_id2 1 15 40 23 0.50 790 0
10 some_subject_id2 2 25 30 29 0.25 207 0
11 some_subject_id2 3 50 32 0 0.75 526 1

Ok, take a careful look at this output. This is a dataframe with 3 subjects worth of data. The first column is the subject id. The second is the trial number but notice how it starts at 0 and counts up to 3 and then stops when a new subject id begins. Next is the stimulus definitions (certain, upper_risk, lower_risk, prob, etc…) and the responses. This is exactly the way people usually code a data file for a cognitive or perceptual experiment. It is “tidy” and includes all the information in a handy way we can use to analyze the data later. Here we generated the data with a simple program that I wrote in the cell above but generally we write a more complex program to handle this including showing the stimlus to the subject, recording their reaction time and so forth. This simple program is a proxy for that more complex program but shows you the important principle of how to organize your data in preparation for analysis. The analysis itself is what we will contemplate over the next several chapters.

6.12. The Split-Apply-Combine Workflow#

Perhaps one of the most common data analysis steps is known at the split-apply-combine workflow. Helping with this workflow is one of the more interesting and powerful features of Pandas (and other dataframe) libraries.

The split-apply-combine workflow refers to a very common sequence of data analysis steps that data analysts have to make. Usually a dataset arrives with multiple sub-units inside it. For example, we might get a .csv file that has all the trials of an experiment from several subjects. Often we want to perform analyses that aggregate across these various subject categories. For example we might want to analyze the performance of each subject. Or alternatively we might want to analyze the difficulty of each item on the test.

Both of these analyses require us to split the larger dataframe up into pieces and perform and analysis on each of these pieces. The step where you perform this analysis is called apply. So we break the bigger dataframe up into pieces, apply an analysis to each piece. The final step is to combine the results of those analyses back into a new dataframe structure for further analysis.

Fig. 6.1 tries to emphasize these steps:

../../_images/split-apply-combine.png

Fig. 6.1 Illustration of the split-apply-combine workflow!#

On the left we begin with our original dataframe (df) which has two columns. One column is categorical and takes on values ‘A’, ‘B’, or ‘C’ while the second column is numeric. The first step of the split-apply-combine workflow is that we might want to split the original dataframe into smaller groups based on the value of one or more of the columns. For this example it makes sense to break the original dataframe into smaller dataframes called “groups” based on the value of col1 (that is to say if the row is in category ‘A’,’B’, or ‘C’). The method that does this in Pandas is .groupby and we will discuss it in a moment. The result of the groupby operation is a partitioning of the original dataframe into smallers sets (called groups) which are the rows of the original dataframe just reorganized. The reorganization is such that all the rows that share the same value on the specified column are group together. For instance in Fig. 6.1 all the rows that had value ‘A’ in col1 are placed into the same group separate from all the rows that had value ‘B’ in col1 and so forth. This is the split operation.

Next there are several things we can do to each of the groups. We can iterate down them in a for loop for instance (more on that later) but also there are methods that will quickly compute various descriptives statistics to each group, or allow you to “apply” a custom function to each group. This custom function could be quite complex and so it allows you do do a lot of additional processing of the groups. The syntax for this tends to be very compact and helps to limit bugs so it is helpful to learn.

Let’s begin by the simple example in Fig. 6.1:

Here is the dataframe:

sac_df=pd.DataFrame({"col1":['A','B','C','C','B','B','A'],"col2": [1,2,3,4,2,5,3]})
sac_df
col1 col2
0 A 1
1 B 2
2 C 3
3 C 4
4 B 2
5 B 5
6 A 3

Next we will group the rows by the values in col1:

sac_df.groupby('col1')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11809eac8>

The result here doesn’t print out the individual groups but instead returns a pandas.core.groupby.generic.DataFrameGroupBy object. However we can print out the groups using iteration:

for name, group_df in sac_df.groupby('col1'):
    print(name)
    print('-----')
    print(group_df)
    print()
    print()
A
-----
  col1  col2
0    A     1
6    A     3


B
-----
  col1  col2
1    B     2
4    B     2
5    B     5


C
-----
  col1  col2
2    C     3
3    C     4

As you can see, the rows have been sorted into groups based on the value in col1. Since there are three distinct/unique values in col1 there are three groups here but if there were more distinct values there would be more groups.

Next we compute the sum() which applies the sum to each group. We can chain these operations in a sequence:

sac_df.groupby('col1').sum()
col2
col1
A 4
B 9
C 7

The result here is a new dataframe where the sum of the values of col2 for the different subgroups ‘A’,’B’, ‘C’ has been computed.

:::{tip} Check the results by hand to make sure you understand what has been calculated here! :::

Ok this example is fairly simple, what if we consider a more complex dataframe with more columns:

sac_df_2=pd.DataFrame({"col1":['A','B','C','C','B','B','A'],"col2":['A','A','A','B','B','B','B'],"col3": [1,2,3,4,2,5,3],"col4": [1.1,2.3,12.,16,22.21,9,0.5]})
sac_df_2
col1 col2 col3 col4
0 A A 1 1.10
1 B A 2 2.30
2 C A 3 12.00
3 C B 4 16.00
4 B B 2 22.21
5 B B 5 9.00
6 A B 3 0.50

If we .groupby() column 1 (col1) we get the same groups as before but notice they includes all the rows:

for name, group_df in sac_df_2.groupby('col1'):
    print(name)
    print('-----')
    print(group_df)
    print()
    print()
A
-----
  col1 col2  col3  col4
0    A    A     1   1.1
6    A    B     3   0.5


B
-----
  col1 col2  col3   col4
1    B    A     2   2.30
4    B    B     2  22.21
5    B    B     5   9.00


C
-----
  col1 col2  col3  col4
2    C    A     3  12.0
3    C    B     4  16.0

And now the apply-combine step computes the sum for all the numeric columns.

sac_df_2.groupby('col1').sum()
col3 col4
col1
A 4 1.60
B 9 33.51
C 7 28.00

Of course you don’t have to apply the sum to all the columns. If you just want to analyze col4 you can select it out before applying the sum:

sac_df_2.groupby('col1')['col4'].sum()
col1
A     1.60
B    33.51
C    28.00
Name: col4, dtype: float64

You can also group using more than one column as the grouping factors. For example in sac_df_2 there are actually two columns that have discrete values and so you might want to make groups that are all combinations of those two factors. If so you just pass a list of grouping columns to .groupby():

for name, group_df in sac_df_2.groupby(['col1','col2']):
    print(name)
    print('-----')
    print(group_df)
    print()
    print()
('A', 'A')
-----
  col1 col2  col3  col4
0    A    A     1   1.1


('A', 'B')
-----
  col1 col2  col3  col4
6    A    B     3   0.5


('B', 'A')
-----
  col1 col2  col3  col4
1    B    A     2   2.3


('B', 'B')
-----
  col1 col2  col3   col4
4    B    B     2  22.21
5    B    B     5   9.00


('C', 'A')
-----
  col1 col2  col3  col4
2    C    A     3  12.0


('C', 'B')
-----
  col1 col2  col3  col4
3    C    B     4  16.0

Then the sum operation will apply to all combinations.

sac_df_2.groupby(['col1','col2']).sum()
col3 col4
col1 col2
A A 1 1.10
B 3 0.50
B A 2 2.30
B 7 31.21
C A 3 12.00
B 4 16.00

The output here looks a little different because the columns are organized hierarchically (the specific terms is that the index is hierarchical). But we can call .reset_index() a function we mentioned earlier to delete the hierarchical index and flatten things back to a standard data frame.

sac_df_2.groupby(['col1','col2']).sum().reset_index()
col1 col2 col3 col4
0 A A 1 1.10
1 A B 3 0.50
2 B A 2 2.30
3 B B 7 31.21
4 C A 3 12.00
5 C B 4 16.00

The final result here is itself a dataframe and so can be use for further analysis and plotting.

6.12.1. Changing the “apply” step#

In the example we considered so far we summed the columns within in group. However there are actually several types of things you can do to groups in the “apply” step:

  • Aggregation: computes a descriptive statistic like a sum, mean, max, min, etc…

  • Transformation: perform a group-specific computation like standardsize (z-score) within a group

  • Filtration: discard data from small groups or filter data from a group based on the sum or mean in that group (e.g., removing outliers in a group)

6.12.1.1. Aggregation#

So far we computed the sum of columns within each group. However there are several other functions you can use. For example to compute the median instead:

sac_df_2.groupby('col1').median()
col3 col4
col1
A 2.0 0.8
B 2.0 9.0
C 3.5 14.0

Here is a list of common aggregation functions:

Table 6.1 Example aggregation functions#

Function

Description

.mean()

Compute mean of groups

sum()

Compute sum of group values

size()

Compute group sizes

count()

Compute count of group

std()

Standard deviation of groups

var()

Compute variance of groups

sem()

Standard error of the mean of groups

describe()

Generates descriptive statistics

first()

Compute first of group values

last()

Compute last of group values

nth()

Take nth value, or a subset if n is a list

min()

Compute min of group values

max()

Compute max of group values

In addition there is a special .agg() (or .aggregate()) function that you can pass mulitple functions to from other libraries:

sac_df.groupby('col1').agg(np.sum)
co32 col4
col1
A 4 1.60
B 9 33.51
C 7 28.00

The previous example used the numpy np.sum() function. You can also provide multiple functions at once:

sac_df
col1 col2 co32 col4
0 A A 1 1.10
1 B A 2 2.30
2 C A 3 12.00
3 C B 4 16.00
4 B B 2 22.21
5 B B 5 9.00
6 A B 3 0.50
sac_df.groupby('col1').agg([np.sum, np.mean, np.std])
col2
sum mean std
col1
A 4 2.0 1.414214
B 9 3.0 1.732051
C 7 3.5 0.707107

Which as you can see actually computes mutiple statistics on each group.

See also

You can read more about aggregation in the Pandos docs on Aggregation.

6.12.1.2. Transform#

Aggregation methods take in a grouped dataframe and compute a single number of statistics for each group. In contrast transform applies a function to a column and returns a column of the same size usually. This allows you to change the data within each sub group. A common version of this for behavioral science is to z-score data within a group to detect outliers, etc… Since that is the most common use case we’ll focus on that here.

sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())
col3 col4
0 -0.707107 0.707107
1 -0.577350 -0.875545
2 -0.707107 -0.707107
3 0.707107 0.707107
4 -0.577350 1.089743
5 1.154701 -0.214198
6 0.707107 -0.707107

The transform function takes as argument a lambda function which is a way to write short one-line functions in python. This function takes 1 argument (x) which is the dataframe coresponding to each column of each group. It then computes the z-score of this data within the group and column and the result is combined into a new dataframe. This new data frame has the same column names as the original but we might want to rename them:

transformed_df=sac_df_2.groupby('col1').transform(lambda x: (x-x.mean())/x.std())
transformed_df=transformed_df.rename(columns={'col3':'col3_trans', 'col4': 'col4_trans'})
transformed_df
col3_trans col4_trans
0 -0.707107 0.707107
1 -0.577350 -0.875545
2 -0.707107 -0.707107
3 0.707107 0.707107
4 -0.577350 1.089743
5 1.154701 -0.214198
6 0.707107 -0.707107

To add these transformed columns back into our original data frame as new column we can use the .join() function which is like .concat() which we leared about earlier but combines the columns from a dataframe rather than appending the rows:

sac_df_2.join(transformed_df)
col1 col2 col3 col4 col3_trans col4_trans
0 A A 1 1.10 -0.707107 0.707107
1 B A 2 2.30 -0.577350 -0.875545
2 C A 3 12.00 -0.707107 -0.707107
3 C B 4 16.00 0.707107 0.707107
4 B B 2 22.21 -0.577350 1.089743
5 B B 5 9.00 1.154701 -0.214198
6 A B 3 0.50 0.707107 -0.707107

The end result is that we z-scored col3 and col4 within the groups implies by col1.

See also

You can read more about aggregation in the Pandos docs on Transformation.

6.12.1.3. Filtration#

With the filtration apply step you provide a function that returns true or false for each group. Any group that the function returns True for is kept for the new dataframe and any group that returns False is dropped. For example this step removes rows from the dataframe that belong to groups with only one or two members:

dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
dff
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 b
5 5 b
6 6 c
7 7 c
dff.groupby('B').filter(lambda x: len(x) > 2)
A B
2 2 b
3 3 b
4 4 b
5 5 b

Notice ‘a’ and ‘c’ entries have been removed because those groups have only two members each.

See also

You can read more about aggregation in the Pandos docs on Filtration.

6.13. Further Reading and Resources#

6.14. References#

Glu13

R. Glushko, editor. The discipline or organizing. MIT Press, 2013.

Har16

M. Harrison. Learning the Pandas Library: Python Tools for Data Munging, Analysis, and Visual: First Edition. hairysun.com, 2016.

WG17

H. Wickham and G. Grolemund. R for Data Science. O'Reilly, 2017.


1

Internally, pandas is primarily organized using columns. This just means that the computer code underneath pandas likes adding columns and hates adding new rows one at a time. It is just a thing. Thus, adding one row at a time is very inefficient. As a result it is usually best to get all your rows set and then create the dataframe.