Filtering and Groupby Answers#
Note: This answers sheet is incomplete
Part 1#
Run the following cells to import and clean the usgs data for the practice problems in part 1.
import random
import pandas as pd
# Read in the data
water_vars = pd.read_csv('../data/englewood_3_12_21_usgs_water.tsv', sep='\t', skiprows=30)
# There are a lot of variables here, so let's shorten our dataframe to a few variables
water_vars = water_vars[['datetime', '210920_00060', '210922_00010', '210924_00300', '210925_00400']]
# Get rid of the first row of hard-coded datatype info
water_vars = water_vars.drop(0)
# Rename the columns from their USGS codes to more human-readible names
name_codes = {'210920_00060': 'discharge','210922_00010': 'temperature', '210924_00300': 'dissolved oxygen', '210925_00400': 'pH'}
water_vars = water_vars.rename(columns=name_codes)
# Convert columns with numbers to a numeric type
water_vars['discharge'] = pd.to_numeric(water_vars['discharge'])
water_vars['temperature'] = pd.to_numeric(water_vars['temperature'])
water_vars['dissolved oxygen'] = pd.to_numeric(water_vars['dissolved oxygen'])
water_vars['pH'] = pd.to_numeric(water_vars['pH'])
water_vars['dam release'] = [random.choice([True, False]) for x in range(len(water_vars))]
water_vars['safety level'] = [random.choice(['low', 'medium', 'high']) for x in range(len(water_vars))]
Filtering#
Look at the output of
water_vars['safety level'] == 'high'
. What is the output telling us? Was the safety level high during the observation at index 3?
Create a dataframe showing which obersvations in
water_vars
recorded temperature less than 8.
Create a dataframe showing which obersvations in
water_vars
recorded either dissolved oxygen less than or equal to 8 or pH greater than 8.
Create a dataframe showing which obersvations in
water_vars
recorded discharge between 46 and 49.
Named Indexes#
The following bits of code add a new column to the dataframe made up of random letters and assigns them to a new column called 'sample id'
. Run them before doing the next problems.
import string
letter_ids = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for x in range(146)]
water_vars['sample id'] = letter_ids
Use the new
'sample id'
column and set it as the index in thewater_vars
dataframe.
water_vars = water_vars.set_index('sample id')
water_vars
datetime | discharge | temperature | dissolved oxygen | pH | dam release | safety level | |
---|---|---|---|---|---|---|---|
sample id | |||||||
cnqbq | 2021-03-12 00:00 | 44.5 | 8.1 | 8.3 | 8.1 | False | high |
yllcy | 2021-03-12 00:15 | 44.5 | 8.1 | 8.2 | 8.1 | True | high |
hjmnm | 2021-03-12 00:30 | 44.5 | 8.1 | 8.2 | 8.1 | True | medium |
vzbfl | 2021-03-12 00:45 | 44.5 | 8.1 | 8.1 | 8.1 | False | low |
ufpyz | 2021-03-12 01:00 | 44.5 | 8.1 | 8.1 | 8.1 | False | high |
... | ... | ... | ... | ... | ... | ... | ... |
plzun | 2021-03-13 11:15 | 42.6 | 6.7 | 9.8 | 7.9 | True | high |
hbseu | 2021-03-13 11:30 | 42.6 | 6.7 | 9.9 | 7.9 | False | low |
ylkka | 2021-03-13 11:45 | 42.6 | 6.7 | 10.2 | 7.9 | True | medium |
axxkh | 2021-03-13 12:00 | 46.5 | 6.7 | 10.3 | 7.9 | False | medium |
qlkna | 2021-03-13 12:15 | NaN | 6.6 | 10.3 | 7.9 | False | low |
146 rows × 7 columns
Pick a sample id and get the data for just that row.
# print out your dataframe to pick an index that exists
# water_vars.loc['ceule']
We have mostly been looking at boolean comparisons with numbers, but you can also interrogate strings with special methods and get a list of booleans as a result. Consider the following bit of code:
water_vars.index.str.contains('k')
array([False, False, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
True, False, True, False, False, False, False, False, True,
False, False, True, False, False, True, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
True, True, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, True,
False, False, True, False, False, False, False, False, False,
False, False, True, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, True,
True, True, False, False, False, False, False, False, False,
False, True, False, True, False, False, False, True, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, True, False, False, True,
True, True])
This code is telling us True/False for which indexes contain the letter k in them.
Using what you know about pandas filtering, use water_vars.loc[]
and the code above (water_vars.index.str.contains('k')
) to filter the data frame to only return the rows which have an index containing the letter k.
water_vars.loc[water_vars.index.str.contains('k')]
datetime | discharge | temperature | dissolved oxygen | pH | dam release | safety level | |
---|---|---|---|---|---|---|---|
sample id | |||||||
mwlek | 2021-03-12 02:45 | 44.5 | 7.9 | 7.9 | 8.1 | False | low |
jorkv | 2021-03-12 04:30 | 46.5 | 7.5 | 8.0 | 8.1 | True | high |
ochke | 2021-03-12 05:00 | 46.5 | 7.4 | 8.0 | 8.0 | False | medium |
vunjk | 2021-03-12 06:30 | 48.5 | 7.1 | 8.2 | 7.9 | True | high |
fskas | 2021-03-12 07:15 | 48.5 | 6.9 | 8.2 | 7.9 | False | medium |
jknmz | 2021-03-12 08:00 | 48.5 | 6.8 | 8.4 | 7.9 | False | high |
uloxk | 2021-03-12 08:15 | 48.5 | 6.7 | 8.4 | 7.9 | True | medium |
ltktt | 2021-03-12 13:30 | 40.8 | 7.6 | 11.9 | 8.1 | False | medium |
tguik | 2021-03-12 13:45 | 40.8 | 7.7 | 12.1 | 8.1 | False | medium |
dzkcn | 2021-03-12 20:00 | 46.5 | 8.2 | 10.3 | 8.0 | False | medium |
ofkkd | 2021-03-12 20:45 | 46.5 | 8.0 | 9.9 | 7.9 | True | medium |
ivezk | 2021-03-12 23:00 | 44.5 | 7.7 | 8.9 | 7.9 | True | low |
jzkqk | 2021-03-13 02:45 | 44.5 | 7.6 | 8.2 | 8.0 | True | medium |
yakqi | 2021-03-13 03:00 | 44.5 | 7.6 | 8.2 | 8.0 | True | medium |
kfnyd | 2021-03-13 03:15 | 44.5 | 7.6 | 8.2 | 8.0 | True | high |
jwkqv | 2021-03-13 05:30 | 46.5 | 7.3 | 8.2 | 7.9 | False | low |
lanzk | 2021-03-13 06:00 | 46.5 | 7.2 | 8.2 | 7.9 | True | medium |
ekkur | 2021-03-13 07:00 | 46.5 | NaN | 8.2 | 7.9 | False | high |
dknty | 2021-03-13 11:00 | 42.6 | 6.7 | 9.7 | 7.9 | False | high |
ylkka | 2021-03-13 11:45 | 42.6 | 6.7 | 10.2 | 7.9 | True | medium |
axxkh | 2021-03-13 12:00 | 46.5 | 6.7 | 10.3 | 7.9 | False | medium |
qlkna | 2021-03-13 12:15 | NaN | 6.6 | 10.3 | 7.9 | False | low |
Groupby#
Fill in the three groupby questions for the following scenario:
Given the environmental sustainbility index (ESI) for every country, what is the minimum ESI on each continent?
Which variable to group together?
How do we want to group?
What variable do we want to look at?
Fill in the three groupby questions for the following scenario:
Given the graduation rates of many high school, what is the average graduation rate in public vs. private schools?
Which variable to group together?
How do we want to group?
What variable do we want to look at?
Find the mean values of all of the columns at each safety level
water_vars.groupby("safety level").mean()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
1489 try:
-> 1490 result = self.grouper._cython_operation(
1491 "aggregate",
1492 values,
1493 how,
1494 axis=data.ndim - 1,
1495 min_count=min_count,
1496 **kwargs,
1497 )
1498 except NotImplementedError:
1499 # generally if we have numeric_only=False
1500 # and non-applicable functions
1501 # try to python agg
1502 # TODO: shouldn't min_count matter?
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
958 ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
960 values=values,
961 axis=axis,
962 min_count=min_count,
963 comp_ids=ids,
964 ngroups=ngroups,
965 **kwargs,
966 )
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
649 return self._ea_wrap_cython_operation(
650 values,
651 min_count=min_count,
(...)
654 **kwargs,
655 )
--> 657 return self._cython_op_ndim_compat(
658 values,
659 min_count=min_count,
660 ngroups=ngroups,
661 comp_ids=comp_ids,
662 mask=None,
663 **kwargs,
664 )
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
495 return res.T
--> 497 return self._call_cython_op(
498 values,
499 min_count=min_count,
500 ngroups=ngroups,
501 comp_ids=comp_ids,
502 mask=mask,
503 result_mask=result_mask,
504 **kwargs,
505 )
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:541, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
540 out_shape = self._get_output_shape(ngroups, values)
--> 541 func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
542 values = self._get_cython_vals(values)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:173, in WrappedCythonOp._get_cython_function(cls, kind, how, dtype, is_numeric)
171 if "object" not in f.__signatures__:
172 # raise NotImplementedError here rather than TypeError later
--> 173 raise NotImplementedError(
174 f"function is not implemented for this dtype: "
175 f"[how->{how},dtype->{dtype_str}]"
176 )
177 return f
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1692, in _ensure_numeric(x)
1691 try:
-> 1692 x = float(x)
1693 except (TypeError, ValueError):
1694 # e.g. "1+1j" or "foo"
ValueError: could not convert string to float: '2021-03-12 00:002021-03-12 00:152021-03-12 01:002021-03-12 02:152021-03-12 03:302021-03-12 04:302021-03-12 05:152021-03-12 05:302021-03-12 06:302021-03-12 06:452021-03-12 08:002021-03-12 09:452021-03-12 10:302021-03-12 11:152021-03-12 12:002021-03-12 12:152021-03-12 14:002021-03-12 14:452021-03-12 15:152021-03-12 16:002021-03-12 17:152021-03-12 17:452021-03-12 18:152021-03-12 19:002021-03-12 22:002021-03-12 23:302021-03-13 00:302021-03-13 00:452021-03-13 01:152021-03-13 01:302021-03-13 01:452021-03-13 03:152021-03-13 03:452021-03-13 04:302021-03-13 04:452021-03-13 05:452021-03-13 06:452021-03-13 07:002021-03-13 08:002021-03-13 08:152021-03-13 09:002021-03-13 09:452021-03-13 10:302021-03-13 10:452021-03-13 11:002021-03-13 11:15'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1696, in _ensure_numeric(x)
1695 try:
-> 1696 x = complex(x)
1697 except ValueError as err:
1698 # e.g. "foo"
ValueError: complex() arg is a malformed string
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 water_vars.groupby("safety level").mean()
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1855, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
1853 return self._numba_agg_general(sliding_mean, engine_kwargs)
1854 else:
-> 1855 result = self._cython_agg_general(
1856 "mean",
1857 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
1858 numeric_only=numeric_only,
1859 )
1860 return result.__finalize__(self.obj, method="groupby")
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
1505 return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
1508 res = self._wrap_agged_manager(new_mgr)
1509 out = self._wrap_aggregated_output(res)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
1499 if blk.is_object:
1500 # split on object-dtype blocks bc some columns may raise
1501 # while others do not.
1502 for sb in blk._split():
-> 1503 applied = sb.apply(func)
1504 result_blocks = extend_blocks(applied, result_blocks)
1505 else:
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
323 @final
324 def apply(self, func, **kwargs) -> list[Block]:
325 """
326 apply the function to my values; return a block if we are not
327 one
328 """
--> 329 result = func(self.values, **kwargs)
331 return self._split_op_result(result)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1503, in GroupBy._cython_agg_general.<locals>.array_func(values)
1490 result = self.grouper._cython_operation(
1491 "aggregate",
1492 values,
(...)
1496 **kwargs,
1497 )
1498 except NotImplementedError:
1499 # generally if we have numeric_only=False
1500 # and non-applicable functions
1501 # try to python agg
1502 # TODO: shouldn't min_count matter?
-> 1503 result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
1505 return result
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1457, in GroupBy._agg_py_fallback(self, values, ndim, alt)
1452 ser = df.iloc[:, 0]
1454 # We do not get here with UDFs, so we know that our dtype
1455 # should always be preserved by the implemented aggregations
1456 # TODO: Is this exactly right; see WrappedCythonOp get_result_dtype?
-> 1457 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
1459 if isinstance(values, Categorical):
1460 # Because we only get here with known dtype-preserving
1461 # reductions, we cast back to Categorical.
1462 # TODO: if we ever get "rank" working, exclude it here.
1463 res_values = type(values)._from_sequence(res_values, dtype=values.dtype)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:994, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
987 if len(obj) > 0 and not isinstance(obj._values, np.ndarray):
988 # we can preserve a little bit more aggressively with EA dtype
989 # because maybe_cast_pointwise_result will do a try/except
990 # with _from_sequence. NB we are assuming here that _from_sequence
991 # is sufficiently strict that it casts appropriately.
992 preserve_dtype = True
--> 994 result = self._aggregate_series_pure_python(obj, func)
996 npvalues = lib.maybe_convert_objects(result, try_float=False)
997 if preserve_dtype:
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:1015, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
1012 splitter = self._get_splitter(obj, axis=0)
1014 for i, group in enumerate(splitter):
-> 1015 res = func(group)
1016 res = libreduction.extract_result(res)
1018 if not initialized:
1019 # We only do this validation on the first iteration
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1857, in GroupBy.mean.<locals>.<lambda>(x)
1853 return self._numba_agg_general(sliding_mean, engine_kwargs)
1854 else:
1855 result = self._cython_agg_general(
1856 "mean",
-> 1857 alt=lambda x: Series(x).mean(numeric_only=numeric_only),
1858 numeric_only=numeric_only,
1859 )
1860 return result.__finalize__(self.obj, method="groupby")
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11556, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
11539 @doc(
11540 _num_doc,
11541 desc="Return the mean of the values over the requested axis.",
(...)
11554 **kwargs,
11555 ):
> 11556 return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11201, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11194 def mean(
11195 self,
11196 axis: Axis | None = 0,
(...)
11199 **kwargs,
11200 ) -> Series | float:
> 11201 return self._stat_function(
11202 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
11203 )
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11158, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11154 nv.validate_stat_func((), kwargs, fname=name)
11156 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11158 return self._reduce(
11159 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11160 )
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/series.py:4670, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4665 raise TypeError(
4666 f"Series.{name} does not allow {kwd_name}={numeric_only} "
4667 "with non-numeric dtypes."
4668 )
4669 with np.errstate(all="ignore"):
-> 4670 return op(delegate, skipna=skipna, **kwds)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
94 try:
95 with np.errstate(invalid="ignore"):
---> 96 return f(*args, **kwargs)
97 except ValueError as e:
98 # we want to transform an object array
99 # ValueError message to the more typical TypeError
100 # e.g. this is normally a disallowed function on
101 # object arrays that contain strings
102 if is_object_dtype(args[0]):
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
156 result = alt(values, axis=axis, skipna=skipna, **kwds)
157 else:
--> 158 result = alt(values, axis=axis, skipna=skipna, **kwds)
160 return result
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
418 if datetimelike and mask is None:
419 mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
423 if datetimelike:
424 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
724 dtype_count = dtype
726 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
729 if axis is not None and getattr(the_sum, "ndim", False):
730 count = cast(np.ndarray, count)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1699, in _ensure_numeric(x)
1696 x = complex(x)
1697 except ValueError as err:
1698 # e.g. "foo"
-> 1699 raise TypeError(f"Could not convert {x} to numeric") from err
1700 return x
TypeError: Could not convert 2021-03-12 00:002021-03-12 00:152021-03-12 01:002021-03-12 02:152021-03-12 03:302021-03-12 04:302021-03-12 05:152021-03-12 05:302021-03-12 06:302021-03-12 06:452021-03-12 08:002021-03-12 09:452021-03-12 10:302021-03-12 11:152021-03-12 12:002021-03-12 12:152021-03-12 14:002021-03-12 14:452021-03-12 15:152021-03-12 16:002021-03-12 17:152021-03-12 17:452021-03-12 18:152021-03-12 19:002021-03-12 22:002021-03-12 23:302021-03-13 00:302021-03-13 00:452021-03-13 01:152021-03-13 01:302021-03-13 01:452021-03-13 03:152021-03-13 03:452021-03-13 04:302021-03-13 04:452021-03-13 05:452021-03-13 06:452021-03-13 07:002021-03-13 08:002021-03-13 08:152021-03-13 09:002021-03-13 09:452021-03-13 10:302021-03-13 10:452021-03-13 11:002021-03-13 11:15 to numeric
Find the mean values of all of the columns at each discharge (less meaningful, but just for practice)
water_vars.groupby("discharge").mean()
temperature | dissolved oxygen | pH | dam release | |
---|---|---|---|---|
discharge | ||||
39.0 | 8.385714 | 12.957143 | 8.200000 | 0.571429 |
40.8 | 8.027273 | 12.309091 | 8.136364 | 0.272727 |
42.6 | 7.261538 | 10.807692 | 7.992308 | 0.461538 |
44.5 | 7.665385 | 8.778846 | 8.005769 | 0.480769 |
46.5 | 7.314583 | 8.820000 | 7.922000 | 0.360000 |
48.5 | 7.008333 | 8.216667 | 7.933333 | 0.833333 |
Find the number of non-nan observations for each column at each safety level
water_vars.groupby("safety level").count()
datetime | discharge | temperature | dissolved oxygen | pH | dam release | |
---|---|---|---|---|---|---|
safety level | ||||||
high | 46 | 46 | 46 | 46 | 46 | 46 |
low | 53 | 53 | 51 | 53 | 53 | 53 |
medium | 47 | 46 | 47 | 47 | 47 | 47 |
Display the mean values of just the temperature column at each safety level
water_vars.groupby("safety level")['temperature'].mean()
safety level
high 7.426087
low 7.590196
medium 7.512766
Name: temperature, dtype: float64
Part 2#
For the next set of practice problems we are going to use the WAS 2020 SARP data. Run the following lines of code to import the data.
was_2020_filepath = "../data/SARP 2020 final.xlsx"
was_2020 = pd.read_excel(was_2020_filepath, "INPUT", skipfooter=7)
Question 1#
A) Using was_2020
find data samples where 'CO2 (ppmv)'
was greater than 420.
was_2020[was_2020['CO2 (ppmv)'] > 420]
Can # | Snake can # | Location | State | Box | Date | Time | Altitude (m) | Temp. Celsius | Wind Spd (mph) | ... | 3-Ethyltoluene (MS) | 4-Ethyltoluene (MS) | 2-Ethyltoluene (MS) | 1,3,5-Trimethylbenzene (MS) | 1,2,4-Trimethylbenzene (MS) | alpha-Pinene (B/MS) | beta-Pinene (B/MS) | Unnamed: 113 | Box.1 | CH4 (ppmv height).1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 7338 | 3716 | Buffalo Park, Flagstaff, Arizona | Arizona | 37 | 2020-04-24 | 11:44:00 | 2100 | 18 | 5 | ... | 12 | 16 | 15 | 14 | 22 | 31 | 9 | NaN | 37 | 1.916 |
3 | 5012 | 3702 | Buffalo Park, Flagstaff, Arizona | Arizona | 37 | 2020-05-01 | 11:16:00 | 2100 | 19 | 16 | ... | -888 | -888 | -888 | -888 | -888 | 22 | -888 | NaN | 37 | 1.895 |
5 | 5061 | 3703 | Downtown Sedona, Arizona | Arizona | 37 | 2020-05-01 | 12:40:00 | 1350 | 26 | 8 | ... | -888 | 3 | -888 | -888 | 5 | 26 | -888 | NaN | 37 | 1.907 |
13 | 6123 | 3721 | Mars Hill/Lowell Observatory, Flagstaff, Arizona | Arizona | 37 | 2020-06-03 | 12:50:00 | 2200 | 27 | 8 | ... | 4 | -888 | -888 | -888 | 6 | 67 | 16 | NaN | 37 | 1.887 |
15 | 7335 | 3711 | Downtown Sedona, Arizona | Arizona | 37 | 2020-06-21 | 17:20:00 | 1350 | 36 | 12 | ... | 34 | 15 | 13 | 16 | 51 | 16 | -888 | NaN | 37 | 1.902 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
898 | 9172 | 2211 | Elizabeth Park, central Bellingham, Washington | Washington | 22 | 2020-06-05 | 10:59:00 | 22 | 16.1 | 5 | ... | 5 | -888 | -888 | -888 | 7 | 5 | -888 | NaN | 22 | 2.041 |
902 | 9168 | 2208 | My backyard, rural Bellingham, Washington | Washington | 22 | 2020-07-15 | 10:00:00 | 153 | 17.2 | 0-2mph | ... | 4 | 3 | 3 | -888 | 8 | 121 | 26 | NaN | 22 | 1.916 |
905 | 5109 | 2601 | Carl Traeger middle school, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-04-15 | 10:49:00 | 243.84 | -2.2 | 7 | ... | 55 | 25 | 17 | 19 | 64 | 47 | 11 | NaN | 26 | 1.968 |
906 | 7027 | 2616 | South Park, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-04-15 | 11:01:00 | 231.64 | -1.1 | 3 | ... | -888 | -888 | -888 | -888 | 4 | -888 | -888 | NaN | 26 | 1.965 |
908 | 5131 | 2602 | South Park, Oshkosh, Wisconsin | Wisconsin | 26 | 2020-04-22 | 12:26:00 | 231.64 | 7.2 | 17 | ... | -888 | -888 | -888 | -888 | -888 | 3 | -888 | NaN | 26 | 1.953 |
420 rows × 116 columns
B) How many data samples had CO2 over 420?
Question 2#
A) What is the mean value of 'Isoprene (E/B)'
accross all the samples?
was_2020['Isoprene (E/B)'].mean()
346.80193756727664
B) How many dta samples had values of ‘Isoprene (E/B)’ below 10?
was_2020[was_2020['Isoprene (E/B)'] < 10]['Isoprene (E/B)']
0 4
26 5
27 -888
35 5
66 6
...
911 5
912 -888
915 5
917 4
920 4
Name: Isoprene (E/B), Length: 157, dtype: int64
C) Print out the values of isoprene from the previous part of the question which are less than 10. Do you notice anything unusual about the values?
D) Looking at the values in the previous question you might see that several values are marked with -888. This is likely a nodata value (but its always good to check the data documentation to be sure).
Knowing that there are now nodata values, go back and ammend your code to find the rows where isoprene is less than 10 but also not equal to -888. Now how many rows are there?
E) What is the true mean value of isoprene?
This question was brought to you today as a gentle reminder to always be on the lookout for nodata 🙂
Question 3#
There are a lot of data columns in this dataset, many more than we can reasonably view in the preview. To help us explore our data better we might want to view a smaller subset of the data at once.
A) Create a dataframe that includes only the following columns from the was_2020
dataset: 'Can #'
, 'State'
, 'Date'
, 'CO (ppbv)'
, and 'OCS (MS)'
.
B) Let’s say you would like to view all of the columns of CFCs together. There are several columns that all start with CFC-
: ‘CFC-12 (C/D)’, ‘CFC-11 (D/MS)’, ‘CFC-113 (D)’, ‘CFC-114 (MS)’.
Create a dataframe that includes just the columns whose names start with “CFC-“.
Google help: try “pandas filter by column name string”. This stackoverflow also has several options. The first few use a method called “list comprehension” that we haven’t talked about in lecture yet. Further down are a few other options to try that don’t use that technique.
C) Use the subset of data you created in part be and return just the columns of CFCs and only the rows of samples that were taken in Arizona.
Question 4 - groupby
questions#
A) What are the mean values of 'CO (ppbv)'
from each box?
was_2020.groupby('Box')['CO (ppbv)'].mean()
Box
1 128.652174
3 141.217391
4 125.565217
6 114.333333
7 138.695652
8 204.454545
9 182.045455
10 122.217391
12 240.041667
14 121.818182
15 136.850000
16 127.083333
17 147.826087
19 137.863636
22 108.695652
23 151.958333
26 119.625000
27 157.739130
28 150.391304
29 152.304348
30 115.909091
31 136.125000
32 120.565217
34 118.380952
35 154.318182
36 159.541667
37 111.727273
38 120.285714
40 121.652174
41 197.045455
46 155.136364
49 102.125000
50 132.363636
52 102.625000
54 101.400000
55 107.521739
56 155.590909
57 143.000000
60 112.333333
61 110.260870
62 133.391304
Name: CO (ppbv), dtype: float64
B) What is the minimum value of 'CH4 (ppmv height)'
for each Weather
condition?
was_2020.groupby('Weather')['CH4 (ppmv height)'].min()
Weather
Clear 1.892
Clear and sunny 1.882
Clear skies and sunny 1.936
Clear skies, sunny, warm wind 1.933
Clear/Sunny 1.976
...
sunny sky but hazy 1.980
sunny, but a bit hazy 1.950
sunny, hazy 1.993
sunny, slightly hazy 1.976
windy, about to rain 1.932
Name: CH4 (ppmv height), Length: 87, dtype: float64
C) How many samples were taken in each state?
was_2020.groupby("State")["Can #"].count()
State
Arizona 22
California 204
Colorado 64
Connecticut 23
Florida 22
Georgia 23
Illinois 22
Kentucky 24
Louisiana 21
Massachusetts 46
Minnesota 24
Missouri 22
New Jersey 23
New York 94
Ohio 22
Oregon 24
Texas 113
Utah 23
Virginia 66
Washington 23
Wisconsin 24
Name: Can #, dtype: int64
For the final two questions we will switch back to the water dataset
Question 5#
.groupby
also allows you to group on multiple fields. So instead of saying “what is the mean value of CO in each state?”, we can be more specific, and ask, “What is the mean value of CO in each state at every wind speed?”
We do this by using a list
of values in the .groupby()
argument instead of just a single column string.
.groupby(["Box", "Wind Spd (mph)"])
A) Try using this syntax to find the mean value of CO in each state at every wind speed
was_2020.groupby(["State", "Wind Spd (mph)"])['CO (ppbv)'].mean()['Illinois']
Wind Spd (mph)
3 128.0
4 124.5
5 229.5
6 122.5
7 151.0
8 153.0
9 79.0
10 97.0
11 115.0
15-25 94.0
15-26 108.0
15-27 116.0
17-26 109.0
18-26 111.0
Name: CO (ppbv), dtype: float64
B) What was the mean value of CO in Georgia when the wind speed was 8 mph?
Question 6#
The current .groupby
function allows to use a single aggregation function at once – we can see either the means for every columns, or the max of every column, and so on.
To look use different aggregation functions for different columns we drop the aggregation function and instead use .agg()
.
Here is an example using the water dataset:
water_vars.groupby("dam release").agg(
max_discharge=pd.NamedAgg(column='discharge', aggfunc='max'),
mean_doxy=pd.NamedAgg(column="dissolved oxygen", aggfunc='mean')
)
max_discharge | mean_doxy | |
---|---|---|
dam release | ||
False | 48.5 | 9.466667 |
True | 48.5 | 9.343243 |
Use the .agg()
function to find the maximum values of 'CH4 (ppmv height)'
, the mean values of ‘Toluene (E/B/MS)’, and the total number of samples for each state.
# copied improperly
# was_2020.groupby("State").agg(
# max_discharge=pd.NamedAgg(column='discharge', aggfunc='max'),
# mean_doxy=pd.NamedAgg(column="dissolved oxygen", aggfunc='mean')
# )