Filtering and groupby#

Context#

We spent yesterday getting comfortable in pandas. We opened some data, explored it, and ran a few functions. Today we are going to expand on what we are doing by learning a few powerful ways to interrogate our data – filtering and grouping. These manipulations help us get at the specific part of a dataset that we need.

Loading our data#

We’ll be using the same USGS water dataset as lesson 3. Here we read and pre-process the data.

import pandas as pd

# This line shortens the output
pd.set_option('display.max_rows', 10)
# Read in the data
water_vars = pd.read_csv('./data/englewood_3_12_21_usgs_water.tsv', sep='\t', skiprows=30)
# There are a lot of variables here, so let's shorten our dataframe to a few variables
water_vars = water_vars[['datetime', '210920_00060', '210922_00010', '210924_00300', '210925_00400']]
# Get rid of the first row of hard-coded datatype info
water_vars = water_vars.drop(0)
# Rename the columns from their USGS codes to more human-readible names
name_codes = {'210920_00060': 'discharge','210922_00010': 'temperature', '210924_00300': 'dissolved oxygen', '210925_00400': 'pH'}
water_vars = water_vars.rename(columns=name_codes)
# Convert columns with numbers to a numeric type
water_vars['discharge'] = pd.to_numeric(water_vars['discharge'])
water_vars['temperature'] = pd.to_numeric(water_vars['temperature'])
water_vars['dissolved oxygen'] = pd.to_numeric(water_vars['dissolved oxygen'])
water_vars['pH'] = pd.to_numeric(water_vars['pH'])

Today I’m also going to add two fake columns of data, “dam release” and “safety level”, that will help us as we go through some of the new concepts.

import random
water_vars['dam release'] = [random.choice([True, False]) for x in range(len(water_vars))]
water_vars['safety level'] = [random.choice(['low', 'medium', 'high']) for x in range(len(water_vars))]
water_vars
datetime discharge temperature dissolved oxygen pH dam release safety level
1 2021-03-12 00:00 44.5 8.1 8.3 8.1 False high
2 2021-03-12 00:15 44.5 8.1 8.2 8.1 True high
3 2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
4 2021-03-12 00:45 44.5 8.1 8.1 8.1 True medium
5 2021-03-12 01:00 44.5 8.1 8.1 8.1 False medium
... ... ... ... ... ... ... ...
142 2021-03-13 11:15 42.6 6.7 9.8 7.9 True high
143 2021-03-13 11:30 42.6 6.7 9.9 7.9 True medium
144 2021-03-13 11:45 42.6 6.7 10.2 7.9 False medium
145 2021-03-13 12:00 46.5 6.7 10.3 7.9 True medium
146 2021-03-13 12:15 NaN 6.6 10.3 7.9 False high

146 rows Ă— 7 columns

Filtering#

So let’s start with a general question. We learned about booleans and comparisons on Monday of this week.

What would happen if we used a comparison on a pandas dataframe?

Give this a think: water_vars['discharge'] == 8.1

The answer? We get another dataframe. This dataframe is the size and shape as the thing used in comparison, but it is full of Boolean values telling us if the comparsion was true.

So what is the use of a list of booleans? Let’s start with a simplified list of booleans and experiment.

import random
# Create just a list of booleans, instead of a full dataframe
random_booleans = [random.choice([True, False]) for x in range(146)]
# Use the list of booleans with square brackets, sort of like a key in a dictionary
water_vars[random_booleans]
datetime discharge temperature dissolved oxygen pH dam release safety level
1 2021-03-12 00:00 44.5 8.1 8.3 8.1 False high
3 2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
4 2021-03-12 00:45 44.5 8.1 8.1 8.1 True medium
5 2021-03-12 01:00 44.5 8.1 8.1 8.1 False medium
6 2021-03-12 01:15 44.5 8.1 8.1 8.1 False medium
... ... ... ... ... ... ... ...
133 2021-03-13 09:00 46.5 6.8 8.9 7.8 False medium
134 2021-03-13 09:15 46.5 6.8 9.0 7.9 True low
140 2021-03-13 10:45 44.5 6.7 9.6 7.9 True high
141 2021-03-13 11:00 42.6 6.7 9.7 7.9 False medium
142 2021-03-13 11:15 42.6 6.7 9.8 7.9 True high

68 rows Ă— 7 columns

That returned only the rows where the value in the list was True!

This is exciting, because this is the basis for filtering rows in pandas. Trying it again with our conditional statement from above water_vars['dissolved oxygen'] == 8.2:

# make a dataframe indicating where dissolved oxygen == 8.2
boolean_dataframe = water_vars['dissolved oxygen'] == 8.2
# returns only the rows where dissolved oxygen == 8.2
water_vars[boolean_dataframe]
datetime discharge temperature dissolved oxygen pH dam release safety level
2 2021-03-12 00:15 44.5 8.1 8.2 8.1 True high
3 2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
27 2021-03-12 06:30 48.5 7.1 8.2 7.9 False low
28 2021-03-12 06:45 48.5 7.0 8.2 7.9 True high
29 2021-03-12 07:00 48.5 7.0 8.2 7.9 False low
... ... ... ... ... ... ... ...
121 2021-03-13 06:00 46.5 7.2 8.2 7.9 True high
122 2021-03-13 06:15 46.5 7.2 8.2 7.9 False medium
123 2021-03-13 06:30 46.5 7.1 8.2 7.9 False high
124 2021-03-13 06:45 46.5 7.1 8.2 7.9 True high
125 2021-03-13 07:00 46.5 NaN 8.2 7.9 False medium

24 rows Ă— 7 columns

While there are several steps, this process is most commonly done in a single step:

water_vars[water_vars['dissolved oxygen'] == 8.2]
datetime discharge temperature dissolved oxygen pH dam release safety level
2 2021-03-12 00:15 44.5 8.1 8.2 8.1 True high
3 2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
27 2021-03-12 06:30 48.5 7.1 8.2 7.9 False low
28 2021-03-12 06:45 48.5 7.0 8.2 7.9 True high
29 2021-03-12 07:00 48.5 7.0 8.2 7.9 False low
... ... ... ... ... ... ... ...
121 2021-03-13 06:00 46.5 7.2 8.2 7.9 True high
122 2021-03-13 06:15 46.5 7.2 8.2 7.9 False medium
123 2021-03-13 06:30 46.5 7.1 8.2 7.9 False high
124 2021-03-13 06:45 46.5 7.1 8.2 7.9 True high
125 2021-03-13 07:00 46.5 NaN 8.2 7.9 False medium

24 rows Ă— 7 columns

đź“ť Check your understanding

Write a cell of code that returns only the rows of the water_vars dataframe where discharge is less than or equal to 46.

Using and and or#

You can continue to make your filter more specific by using multiple filters. The syntax that we used in the first lesson was:

num_cats = 3
num_dogs = 2
print(num_cats > 2 and num_dogs > 2)
print(num_cats > 2 or num_dogs > 2)
False
True

While the meaning of and and or is the same in pandas, the syntax is slightly different. In pandas:

Python

Pandas

and

&

or

|

You also have to be sure to wrap each individual conditional statement in parenthesis ().

# Cells where pH < 8 and dam release was True
water_vars[(water_vars['pH'] < 8) & (water_vars['dam release'] == True)]
datetime discharge temperature dissolved oxygen pH dam release safety level
28 2021-03-12 06:45 48.5 7.0 8.2 7.9 True high
31 2021-03-12 07:30 48.5 6.9 8.3 7.9 True high
33 2021-03-12 08:00 48.5 6.8 8.4 7.9 True medium
34 2021-03-12 08:15 48.5 6.7 8.4 7.9 True low
36 2021-03-12 08:45 46.5 6.7 8.6 7.9 True low
... ... ... ... ... ... ... ...
139 2021-03-13 10:30 44.5 6.8 9.6 7.9 True low
140 2021-03-13 10:45 44.5 6.7 9.6 7.9 True high
142 2021-03-13 11:15 42.6 6.7 9.8 7.9 True high
143 2021-03-13 11:30 42.6 6.7 9.9 7.9 True medium
145 2021-03-13 12:00 46.5 6.7 10.3 7.9 True medium

29 rows Ă— 7 columns

water_vars[(water_vars['pH'] < 8) | (water_vars['dam release'] == True)]
datetime discharge temperature dissolved oxygen pH dam release safety level
2 2021-03-12 00:15 44.5 8.1 8.2 8.1 True high
3 2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
4 2021-03-12 00:45 44.5 8.1 8.1 8.1 True medium
9 2021-03-12 02:00 44.5 8.0 8.0 8.1 True low
12 2021-03-12 02:45 44.5 7.9 7.9 8.1 True medium
... ... ... ... ... ... ... ...
142 2021-03-13 11:15 42.6 6.7 9.8 7.9 True high
143 2021-03-13 11:30 42.6 6.7 9.9 7.9 True medium
144 2021-03-13 11:45 42.6 6.7 10.2 7.9 False medium
145 2021-03-13 12:00 46.5 6.7 10.3 7.9 True medium
146 2021-03-13 12:15 NaN 6.6 10.3 7.9 False high

98 rows Ă— 7 columns

There is more than one way to do this type of filtering in pandas, but it is fine to start with one method and feel really comfortable with it before moving to the others. This article has a pretty nice overview of some other methods.

đź“ť Check your understanding

How is the following line of code filtering the water_vars dataset?

water_vars[(water_vars['discharge'] <= 46) | (water_vars['dam release'] > 42)]

Setting an Index#

Whether using pandas or excel or a SQL database, the index is an important part of your organization.

The most important part of in index is that it is unique, meaning no two rows of data is allowed to have the same index

This is important for keeping the data organized, not just practically but also conceptually.

One of the powerful components of pandas is that you aren’t required to have a number as the index of a dataframe. Let’s look at this with an example dataframe

grades = pd.DataFrame(
    {
        'name': ['Beatriz S', 'Sara H', 'Joel T', 'Ari T', 'Hassan A'],
        'history': [78, 69, 80, 91, 79],
        'science': [85, 80, 80, 73, 91],
        'music': [81, 81, 90, 73, 89],
    }
)
grades
name history science music
0 Beatriz S 78 85 81
1 Sara H 69 80 81
2 Joel T 80 80 90
3 Ari T 91 73 73
4 Hassan A 79 91 89

In this dataframe we have default indexes of 0 -> 4 set. We could pretty smoothly use these if we wanted and just know in our heads that Beatriz is index 0, Sara is index 1 and so on.

Alternatively, we could skip the step of remembering index numbers and just use the names of the students as the index. We do that with the .set_index() method.

grades.set_index('name')
history science music
name
Beatriz S 78 85 81
Sara H 69 80 81
Joel T 80 80 90
Ari T 91 73 73
Hassan A 79 91 89
# update the original dataframe with the new index
grades = grades.set_index('name')

.iloc vs. loc#

So now that we have names as indexes, how do we select specific rows?

We can still use the same method we already know, iloc, with an integer index number and we will still get the same result.

grades.iloc[1]
history    69
science    80
music      81
Name: Sara H, dtype: int64

Alternatively we can also use a similar but slightly different method - .loc. .loc allows us to find specific rows based on an index label, not the integer.

grades.loc['Hassan A']
history    79
science    91
music      89
Name: Hassan A, dtype: int64

If you want a deep dive on .iloc vs. .loc this article is for you.

đź“ť Check your understanding

Write 2 lines of code to get Ari’s grades the grades dataframe. In one line use .iloc and in the other use .loc.

Explain to your partner the difference between .iloc and .loc.

One more example - datetime indices#

To move away from our toy example baack to our real data, we have a great column in the water_vars table just waiting to be an index – datetime.

Dates and times are generally fantastic indices, because, if the world is working in proper order, each moment in time only happens once, meaning their values are unique and they are really common indices.

water_vars = water_vars.set_index('datetime')
water_vars
discharge temperature dissolved oxygen pH dam release safety level
datetime
2021-03-12 00:00 44.5 8.1 8.3 8.1 False high
2021-03-12 00:15 44.5 8.1 8.2 8.1 True high
2021-03-12 00:30 44.5 8.1 8.2 8.1 True high
2021-03-12 00:45 44.5 8.1 8.1 8.1 True medium
2021-03-12 01:00 44.5 8.1 8.1 8.1 False medium
... ... ... ... ... ... ...
2021-03-13 11:15 42.6 6.7 9.8 7.9 True high
2021-03-13 11:30 42.6 6.7 9.9 7.9 True medium
2021-03-13 11:45 42.6 6.7 10.2 7.9 False medium
2021-03-13 12:00 46.5 6.7 10.3 7.9 True medium
2021-03-13 12:15 NaN 6.6 10.3 7.9 False high

146 rows Ă— 6 columns

water_vars.loc['2021-03-12 13:00']
discharge             40.8
temperature            7.4
dissolved oxygen      11.6
pH                     8.1
dam release          False
safety level        medium
Name: 2021-03-12 13:00, dtype: object

đź“ť Check your understanding

Write a line of code to get the row corresponding to March 13th 2021 at 8:00am.

Groupby#

While we have 146 individual readings in our dataset sometimes we don’t care about each individual reading. Instead we probably care about the aggregate of a specific group of readings.

For example:

  • Given the average temperature of every county in the US, what is the average temperature in each state?

  • Given a list of the opening dates of every Chuck E Cheese stores, how many Chuck E Cheeses were opened each year? 🧀

In pandas (and tabular data in general) we answer questions like that that with groupby.

Breaking groupby into conceptual parts#

In addition to the dataframe, there are three main parts to a groupby:

  1. Which variable we want to group together

  2. How we want to group

  3. The variable we want to see in the end

Without getting into syntax yet we can start by identifiying these in our two example questions.

Given the average temperature of every county in the US, what is the average temperature in each state?

  • Which variable to group together? -> We want to group counties into states

  • How do we want to group? -> Take the average

  • What variable do we want to look at? Temperature

Given a list of the opening dates of every Chuck E Cheese stores, how many Chuck E Cheeses were opened each year?

  • Which variable to group together? -> We want to group individual days into years

  • How do we want to group? -> Count them

  • What variable do we want to look at? Number of stores

đź“ť Check your understanding

Identify each of three main groupby parts in the following scenario:

Given the hourly temperatures for a location over the course of a month, what were the daily highs?

  1. Which variable to group together?

  2. How do we want to group?

  3. What variable do we want to look at?

groupby syntax#

We can take these groupby concepts and translate them into syntax. The first two parts (which variable to group & how do we want to group) are required for pandas. The third one is optional.

Starting with just the two required variables, the general syntax is:

DATAFRAME.groupby(WHICH_GROUP).AGGREGATION()

Words in all capitals are variables. We’ll go into each part a little more below.

water_vars.groupby("safety level").min()
discharge temperature dissolved oxygen pH dam release
safety level
high 40.8 6.6 8.0 7.8 False
low 39.0 6.7 8.0 7.8 False
medium 39.0 6.7 7.9 7.8 False

'WHICH_GROUP'#

This variable can be any of the columns in the dataframe that can be put into discrete piles. We used 'safety_level' because it can easily be put into piles – low, medium and high.

We are still allowed to group on other columns, it’s just that the practicality can become questionable.

water_vars.groupby("pH").min()
discharge temperature dissolved oxygen dam release safety level
pH
7.8 46.5 6.8 8.3 False high
7.9 42.6 6.6 8.2 False high
8.0 42.6 6.9 8.0 False high
8.1 40.8 7.4 7.9 False high
8.2 39.0 8.1 12.5 False high

Note

Because they are right next to each other it can be really tempting to want to put the variable you are most interested in seeing in place of WHICH_GROUP. If you do this and use a column that can’t be grouped, so consider WHICH_GROUP variable if you are debugging.

Aggregation Functions#

The functions you can use in a groupby are limited, but there are still lots of options. Common ones include:

  • .count() - find the total number of rows

  • .min()- find the minimum value of those rows

  • .max() - find the maximum value of those rows

  • .mean()- find the mean value of those rows

  • .sum() - find the sum of the values of those rows

The third input#

In the conceptual explanation we often also specify which variable we want to look at. In pandas this isn’t always as important, since by default it gives us the full dataframe. If for some reason we do need to specify the exact variable of interest (ex. we have a very large dataframe, we are making a plot), we do that by adding the column name in after the groupby.

DATAFRAME.groupby(WHICH_GROUP)[INTEREST_VAR].AGGREGATION()

# Group our water dataframe by saftey level and take the mean of the discharge values for each safety level group
water_vars.groupby("safety level")['discharge'].mean()
safety level
high      45.329787
low       44.476364
medium    44.648837
Name: discharge, dtype: float64

đź“ť Check your understanding

  1. What is the maximum pH value during low safety level?

  2. What is the mean discharge when dam releases are happening?

Breaking down the process#

There is a lot that happens in a single step with groupby and it can be a lot to take in. One way to mentally situate this process is to think about split-apply-combine.

split-apply-combine breaks down the groupby process into those three steps:

  1. SPLIT the full data set into groups. Split is related to the question Which variable to group together?

  2. APPLY the aggregation function to the individual groups. Apply is related to the question How do we want to group?

  3. COMBINE the aggregated data into a new dataframe