Filtering and Groupby Practice#
Part 1#
Run the following cells to import and clean the usgs data for the practice problems in part 1.
import random
import pandas as pd
# Read in the data
water_vars = pd.read_csv('../data/englewood_3_12_21_usgs_water.tsv', sep='\t', skiprows=30)
# There are a lot of variables here, so let's shorten our dataframe to a few variables
water_vars = water_vars[['datetime', '210920_00060', '210922_00010', '210924_00300', '210925_00400']]
# Get rid of the first row of hard-coded datatype info
water_vars = water_vars.drop(0)
# Rename the columns from their USGS codes to more human-readible names
name_codes = {'210920_00060': 'discharge','210922_00010': 'temperature', '210924_00300': 'dissolved oxygen', '210925_00400': 'pH'}
water_vars = water_vars.rename(columns=name_codes)
# Convert columns with numbers to a numeric type
water_vars['discharge'] = pd.to_numeric(water_vars['discharge'])
water_vars['temperature'] = pd.to_numeric(water_vars['temperature'])
water_vars['dissolved oxygen'] = pd.to_numeric(water_vars['dissolved oxygen'])
water_vars['pH'] = pd.to_numeric(water_vars['pH'])
water_vars['dam release'] = [random.choice([True, False]) for x in range(len(water_vars))]
water_vars['safety level'] = [random.choice(['low', 'medium', 'high']) for x in range(len(water_vars))]
Filtering#
Look at the output of
water_vars['safety level'] == 'high'
. What is the output telling us? Was the safety level high during the observation at index 3?
Create a dataframe showing which obersvations in
water_vars
recorded temperature less than 8.
Create a dataframe showing which obersvations in
water_vars
recorded either dissolved oxygen less than or equal to 8 or pH greater than 8.
Create a dataframe showing which obersvations in
water_vars
recorded discharge between 46 and 49.
Named Indexes#
The following bits of code add a new column to the dataframe made up of random letters and assigns them to a new column called 'sample id'
. Run them before doing the next problems.
import string
letter_ids = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for x in range(146)]
water_vars['sample id'] = letter_ids
Use the new
'sample id'
column and set it as the index in thewater_vars
dataframe.
Pick a sample id and get the data for just that row.
We have mostly been looking at boolean comparisons with numbers, but you can also interrogate strings with special methods and get a list of booleans as a result. Consider the following bit of code:
water_vars.index.str.contains('k')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 water_vars.index.str.contains('k')
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/accessor.py:224, in CachedAccessor.__get__(self, obj, cls)
221 if obj is None:
222 # we're accessing the attribute of the class, i.e., Dataset.geo
223 return self._accessor
--> 224 accessor_obj = self._accessor(obj)
225 # Replace the property with the accessor object. Inspired by:
226 # https://www.pydanny.com/cached-property.html
227 # We need to use object.__setattr__ because we overwrite __setattr__ on
228 # NDFrame
229 object.__setattr__(obj, self._name, accessor_obj)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/strings/accessor.py:181, in StringMethods.__init__(self, data)
178 def __init__(self, data) -> None:
179 from pandas.core.arrays.string_ import StringDtype
--> 181 self._inferred_dtype = self._validate(data)
182 self._is_categorical = is_categorical_dtype(data.dtype)
183 self._is_string = isinstance(data.dtype, StringDtype)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/strings/accessor.py:235, in StringMethods._validate(data)
232 inferred_dtype = lib.infer_dtype(values, skipna=True)
234 if inferred_dtype not in allowed_types:
--> 235 raise AttributeError("Can only use .str accessor with string values!")
236 return inferred_dtype
AttributeError: Can only use .str accessor with string values!
This code is telling us True/False for which indexes contain the letter k in them.
Using what you know about pandas filtering, use water_vars.loc[]
and the code above (water_vars.index.str.contains('k')
) to filter the data frame to only return the rows which have an index containing the letter k.
Groupby#
Fill in the three groupby questions for the following scenario:
Given the environmental sustainbility index (ESI) for every country, what is the minimum ESI on each continent?
Which variable to group together?
How do we want to group?
What variable do we want to look at?
Fill in the three groupby questions for the following scenario:
Given the graduation rates of many high school, what is the average graduation rate in public vs. private schools?
Which variable to group together?
How do we want to group?
What variable do we want to look at?
Find the mean values of all of the columns at each safety level
Find the mean values of all of the columns at each discharge (less meaningful, but just for practice)
Find the number of non-nan observations for each column at each safety level
Display the mean values of just the temperature column at each safety level
Part 2#
For the next set of practice problems we are going to use the WAS 2020 SARP data. Run the following lines of code to import the data.
was_2020_filepath = "../data/SARP 2020 final.xlsx"
was_2020 = pd.read_excel(was_2020_filepath, "INPUT", skipfooter=7)
was_2020
Can # | Snake can # | Location | State | Box | Date | Time | Altitude (m) | Temp. Celsius | Wind Spd (mph) | ... | 3-Ethyltoluene (MS) | 4-Ethyltoluene (MS) | 2-Ethyltoluene (MS) | 1,3,5-Trimethylbenzene (MS) | 1,2,4-Trimethylbenzene (MS) | alpha-Pinene (B/MS) | beta-Pinene (B/MS) | Unnamed: 113 | Box.1 | CH4 (ppmv height).1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5044 | 3701 | Mars Hill/Lowell Observatory, Flagstaff, Arizona | Arizona | 37 | 2020-04-24 | 11:23:00 | 2200 | 18 | 5 | ... | 12 | 17 | 17 | 15 | 23 | 12 | 7 | NaN | 37 | 1.914 |
1 | 7338 | 3716 | Buffalo Park, Flagstaff, Arizona | Arizona | 37 | 2020-04-24 | 11:44:00 | 2100 | 18 | 5 | ... | 12 | 16 | 15 | 14 | 22 | 31 | 9 | NaN | 37 | 1.916 |
2 | 6027 | 3717 | Downtown Sedona, Arizona | Arizona | 37 | 2020-04-24 | 12:47:00 | 1350 | 27 | 5 | ... | -888 | -888 | -888 | -888 | 6 | 11 | -888 | NaN | 37 | 1.914 |
3 | 5012 | 3702 | Buffalo Park, Flagstaff, Arizona | Arizona | 37 | 2020-05-01 | 11:16:00 | 2100 | 19 | 16 | ... | -888 | -888 | -888 | -888 | -888 | 22 | -888 | NaN | 37 | 1.895 |
4 | 7330 | 3715 | Mars Hill/Lowell Observatory, Flagstaff, Arizona | Arizona | 37 | 2020-05-01 | 11:49:00 | 2200 | 19 | 16 | ... | 6 | 3 | -888 | -888 | 8 | 23 | -888 | NaN | 37 | 1.900 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
924 | 7267 | 2610 | Oakwood Manor, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-06-24 | 10:28:00 | 234.696 | 20 | 7 | ... | -888 | -888 | -888 | -888 | 3 | 68 | 45 | NaN | 26 | 2.131 |
925 | 8257 | 2623 | South Park, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-06-24 | 10:42:00 | 231.64 | 21.67 | 6 | ... | -888 | -888 | -888 | -888 | 4 | 33 | 18 | NaN | 26 | 2.074 |
926 | 5194 | 2608 | Carl Traeger middle school, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-07-01 | 13:09:00 | 243.84 | 28.89 | 9 | ... | 4 | -888 | -888 | -888 | 5 | 7 | -888 | NaN | 26 | 2.017 |
927 | 7101 | 2609 | Oakwood Manor, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-07-01 | 13:26:00 | 234.696 | 30.5 | 8 | ... | 3 | -888 | -888 | -888 | 5 | 23 | 5 | NaN | 26 | 2.027 |
928 | 8190 | 2624 | South Park, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-07-01 | 13:50:00 | 231.64 | 30.5 | 9 | ... | 3 | -888 | -888 | -888 | 5 | 11 | 3 | NaN | 26 | 1.983 |
929 rows × 116 columns
Question 1#
A) Using was_2020
find data samples where 'CO2 (ppmv)'
was greater than 420.
B) How many data samples had CO2 over 420?
Question 2#
A) What is the mean value of 'Isoprene (E/B)'
accross all the samples?
B) How many data samples had values of 'Isoprene (E/B)'
below 10?
C) Print out the values of isoprene from the previous part of the question which are less than 10. Do you notice anything unusual about the values?
D) Looking at the values in the previous question you might see that several values are marked with -888. This is likely a nodata value (but its always good to check the data documentation to be sure).
Knowing that there are now nodata values, go back and amend your code to find the rows where isoprene is less than 10 but also not equal to -888. Now how many rows are there?
E) What is the true mean value of isoprene?
This question was brought to you today as a gentle reminder to always be on the lookout for nodata 🙂
Question 3#
There are a lot of data columns in this dataset, many more than we can reasonably view in the preview. To help us explore our data better we might want to view a smaller subset of the data at once.
A) Create a dataframe that includes only the following columns from the was_2020
dataset: 'Can #'
, 'State'
, 'Date'
, 'CO (ppbv)'
, and 'OCS (MS)'
.
B) Let’s say you would like to view all of the columns of CFCs together. There are several columns that all start with CFC-
: ‘CFC-12 (C/D)’, ‘CFC-11 (D/MS)’, ‘CFC-113 (D)’, ‘CFC-114 (MS)’.
Create a dataframe that includes just the columns whose names start with “CFC-“.
Google help: try “pandas filter by column name string”. This stackoverflow also has several options. The first few use a method called “list comprehension” that we haven’t talked about in lecture yet. Further down are a few other options to try that don’t use that technique.
C) Use the subset of data you created in part be and return just the columns of CFCs and only the rows of samples that were taken in Arizona.
Question 4#
groupby
questions
A) What are the mean values of 'CO (ppbv)'
from each box?
B) What is the minimum value of 'CH4 (ppmv height)'
for each Weather
condition?
C) How many samples were taken in each state?
For the final two questions we will switch back to the water dataset
Question 5#
.groupby
also allows you to group on multiple fields. So instead of saying “what is the mean value of CO in each state?”, we can be more specific, and ask, “What is the mean value of CO in each state at every wind speed?”
We do this by using a list
of values in the .groupby()
argument instead of just a single column string.
.groupby(["Box", "Wind Spd (mph)"])
A) Try using this syntax to find the mean value of CO in each state at every wind speed
B) What was the mean value of CO in Georgia when the wind speed was 8 mph?
Question 6#
The current .groupby
function allows to use a single aggregation function at once – we can see either the means for every columns, or the max of every column, and so on.
To look use different aggregation functions for different columns we drop the aggregation function and instead use .agg()
.
Here is an example using the water dataset:
water_vars.groupby("dam release").agg(
max_discharge=pd.NamedAgg(column='discharge', aggfunc='max'),
mean_doxy=pd.NamedAgg(column="dissolved oxygen", aggfunc='mean')
)
max_discharge | mean_doxy | |
---|---|---|
dam release | ||
False | 48.5 | 9.428788 |
True | 48.5 | 9.383750 |
Use the .agg()
function to find the maximum values of 'CH4 (ppmv height)'
, the mean values of ‘Toluene (E/B/MS)’, and the total number of samples for each state.