Filtering and Groupby Answers

Filtering and Groupby Answers#

Note: This answers sheet is incomplete

Part 1#

Run the following cells to import and clean the usgs data for the practice problems in part 1.

import random

import pandas as pd

# Read in the data
water_vars = pd.read_csv('../data/englewood_3_12_21_usgs_water.tsv', sep='\t', skiprows=30)
# There are a lot of variables here, so let's shorten our dataframe to a few variables
water_vars = water_vars[['datetime', '210920_00060', '210922_00010', '210924_00300', '210925_00400']]
# Get rid of the first row of hard-coded datatype info
water_vars = water_vars.drop(0)
# Rename the columns from their USGS codes to more human-readible names
name_codes = {'210920_00060': 'discharge','210922_00010': 'temperature', '210924_00300': 'dissolved oxygen', '210925_00400': 'pH'}
water_vars = water_vars.rename(columns=name_codes)
# Convert columns with numbers to a numeric type
water_vars['discharge'] = pd.to_numeric(water_vars['discharge'])
water_vars['temperature'] = pd.to_numeric(water_vars['temperature'])
water_vars['dissolved oxygen'] = pd.to_numeric(water_vars['dissolved oxygen'])
water_vars['pH'] = pd.to_numeric(water_vars['pH'])

water_vars['dam release'] = [random.choice([True, False]) for x in range(len(water_vars))]
water_vars['safety level'] = [random.choice(['low', 'medium', 'high']) for x in range(len(water_vars))]

Filtering#

Look at the output of water_vars['safety level'] == 'high'. What is the output telling us? Was the safety level high during the observation at index 3?

Create a dataframe showing which obersvations in water_vars recorded temperature less than 8.

Create a dataframe showing which obersvations in water_vars recorded either dissolved oxygen less than or equal to 8 or pH greater than 8.

Create a dataframe showing which obersvations in water_vars recorded discharge between 46 and 49.

Named Indexes#

The following bits of code add a new column to the dataframe made up of random letters and assigns them to a new column called 'sample id'. Run them before doing the next problems.

import string

letter_ids = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for x in range(146)]

water_vars['sample id'] = letter_ids

Use the new 'sample id' column and set it as the index in the water_vars dataframe.

water_vars = water_vars.set_index('sample id')

water_vars

	datetime	discharge	temperature	dissolved oxygen	pH	dam release	safety level
sample id
cnqbq	2021-03-12 00:00	44.5	8.1	8.3	8.1	False	high
yllcy	2021-03-12 00:15	44.5	8.1	8.2	8.1	True	high
hjmnm	2021-03-12 00:30	44.5	8.1	8.2	8.1	True	medium
vzbfl	2021-03-12 00:45	44.5	8.1	8.1	8.1	False	low
ufpyz	2021-03-12 01:00	44.5	8.1	8.1	8.1	False	high
...	...	...	...	...	...	...	...
plzun	2021-03-13 11:15	42.6	6.7	9.8	7.9	True	high
hbseu	2021-03-13 11:30	42.6	6.7	9.9	7.9	False	low
ylkka	2021-03-13 11:45	42.6	6.7	10.2	7.9	True	medium
axxkh	2021-03-13 12:00	46.5	6.7	10.3	7.9	False	medium
qlkna	2021-03-13 12:15	NaN	6.6	10.3	7.9	False	low

146 rows × 7 columns

Pick a sample id and get the data for just that row.

# print out your dataframe to pick an index that exists
# water_vars.loc['ceule']

We have mostly been looking at boolean comparisons with numbers, but you can also interrogate strings with special methods and get a list of booleans as a result. Consider the following bit of code:

water_vars.index.str.contains('k')

array([False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
        True, False,  True, False, False, False, False, False,  True,
       False, False,  True, False, False,  True,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
        True,  True, False, False, False, False, False, False, False,
       False,  True, False,  True, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False,  True,
        True,  True])

This code is telling us True/False for which indexes contain the letter k in them.

Using what you know about pandas filtering, use water_vars.loc[] and the code above (water_vars.index.str.contains('k')) to filter the data frame to only return the rows which have an index containing the letter k.

water_vars.loc[water_vars.index.str.contains('k')]

	datetime	discharge	temperature	dissolved oxygen	pH	dam release	safety level
sample id
mwlek	2021-03-12 02:45	44.5	7.9	7.9	8.1	False	low
jorkv	2021-03-12 04:30	46.5	7.5	8.0	8.1	True	high
ochke	2021-03-12 05:00	46.5	7.4	8.0	8.0	False	medium
vunjk	2021-03-12 06:30	48.5	7.1	8.2	7.9	True	high
fskas	2021-03-12 07:15	48.5	6.9	8.2	7.9	False	medium
jknmz	2021-03-12 08:00	48.5	6.8	8.4	7.9	False	high
uloxk	2021-03-12 08:15	48.5	6.7	8.4	7.9	True	medium
ltktt	2021-03-12 13:30	40.8	7.6	11.9	8.1	False	medium
tguik	2021-03-12 13:45	40.8	7.7	12.1	8.1	False	medium
dzkcn	2021-03-12 20:00	46.5	8.2	10.3	8.0	False	medium
ofkkd	2021-03-12 20:45	46.5	8.0	9.9	7.9	True	medium
ivezk	2021-03-12 23:00	44.5	7.7	8.9	7.9	True	low
jzkqk	2021-03-13 02:45	44.5	7.6	8.2	8.0	True	medium
yakqi	2021-03-13 03:00	44.5	7.6	8.2	8.0	True	medium
kfnyd	2021-03-13 03:15	44.5	7.6	8.2	8.0	True	high
jwkqv	2021-03-13 05:30	46.5	7.3	8.2	7.9	False	low
lanzk	2021-03-13 06:00	46.5	7.2	8.2	7.9	True	medium
ekkur	2021-03-13 07:00	46.5	NaN	8.2	7.9	False	high
dknty	2021-03-13 11:00	42.6	6.7	9.7	7.9	False	high
ylkka	2021-03-13 11:45	42.6	6.7	10.2	7.9	True	medium
axxkh	2021-03-13 12:00	46.5	6.7	10.3	7.9	False	medium
qlkna	2021-03-13 12:15	NaN	6.6	10.3	7.9	False	low

Groupby#

Fill in the three groupby questions for the following scenario:

Given the environmental sustainbility index (ESI) for every country, what is the minimum ESI on each continent?

Which variable to group together?
How do we want to group?
What variable do we want to look at?

Fill in the three groupby questions for the following scenario:

Given the graduation rates of many high school, what is the average graduation rate in public vs. private schools?

Which variable to group together?
How do we want to group?
What variable do we want to look at?

Find the mean values of all of the columns at each safety level

water_vars.groupby("safety level").mean()

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1490, in GroupBy._cython_agg_general.<locals>.array_func(values)
try:
-> 1490     result = self.grouper._cython_operation(
       "aggregate",
       values,
       how,
       axis=data.ndim - 1,
       min_count=min_count,
       **kwargs,
   )
except NotImplementedError:
   # generally if we have numeric_only=False
   # and non-applicable functions
   # try to python agg
   # TODO: shouldn't min_count matter?

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:959, in BaseGrouper._cython_operation(self, kind, values, how, axis, min_count, **kwargs)
ngroups = self.ngroups
--> 959 return cy_op.cython_operation(
   values=values,
   axis=axis,
   min_count=min_count,
   comp_ids=ids,
   ngroups=ngroups,
   **kwargs,
)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:657, in WrappedCythonOp.cython_operation(self, values, axis, min_count, comp_ids, ngroups, **kwargs)
   return self._ea_wrap_cython_operation(
       values,
       min_count=min_count,
   (...)
       **kwargs,
   )
--> 657 return self._cython_op_ndim_compat(
   values,
   min_count=min_count,
   ngroups=ngroups,
   comp_ids=comp_ids,
   mask=None,
   **kwargs,
)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:497, in WrappedCythonOp._cython_op_ndim_compat(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
   return res.T
--> 497 return self._call_cython_op(
   values,
   min_count=min_count,
   ngroups=ngroups,
   comp_ids=comp_ids,
   mask=mask,
   result_mask=result_mask,
   **kwargs,
)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:541, in WrappedCythonOp._call_cython_op(self, values, min_count, ngroups, comp_ids, mask, result_mask, **kwargs)
out_shape = self._get_output_shape(ngroups, values)
--> 541 func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
values = self._get_cython_vals(values)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:173, in WrappedCythonOp._get_cython_function(cls, kind, how, dtype, is_numeric)
if "object" not in f.__signatures__:
   # raise NotImplementedError here rather than TypeError later
--> 173     raise NotImplementedError(
       f"function is not implemented for this dtype: "
       f"[how->{how},dtype->{dtype_str}]"
   )
return f

NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1692, in _ensure_numeric(x)
try:
-> 1692     x = float(x)
except (TypeError, ValueError):
   # e.g. "1+1j" or "foo"

ValueError: could not convert string to float: '2021-03-12 00:002021-03-12 00:152021-03-12 01:002021-03-12 02:152021-03-12 03:302021-03-12 04:302021-03-12 05:152021-03-12 05:302021-03-12 06:302021-03-12 06:452021-03-12 08:002021-03-12 09:452021-03-12 10:302021-03-12 11:152021-03-12 12:002021-03-12 12:152021-03-12 14:002021-03-12 14:452021-03-12 15:152021-03-12 16:002021-03-12 17:152021-03-12 17:452021-03-12 18:152021-03-12 19:002021-03-12 22:002021-03-12 23:302021-03-13 00:302021-03-13 00:452021-03-13 01:152021-03-13 01:302021-03-13 01:452021-03-13 03:152021-03-13 03:452021-03-13 04:302021-03-13 04:452021-03-13 05:452021-03-13 06:452021-03-13 07:002021-03-13 08:002021-03-13 08:152021-03-13 09:002021-03-13 09:452021-03-13 10:302021-03-13 10:452021-03-13 11:002021-03-13 11:15'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1696, in _ensure_numeric(x)
try:
-> 1696     x = complex(x)
except ValueError as err:
   # e.g. "foo"

ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 water_vars.groupby("safety level").mean()

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1855, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   return self._numba_agg_general(sliding_mean, engine_kwargs)
else:
-> 1855     result = self._cython_agg_general(
       "mean",
       alt=lambda x: Series(x).mean(numeric_only=numeric_only),
       numeric_only=numeric_only,
   )
   return result.__finalize__(self.obj, method="groupby")

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1507, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
       result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
   return result
-> 1507 new_mgr = data.grouped_reduce(array_func)
res = self._wrap_agged_manager(new_mgr)
out = self._wrap_aggregated_output(res)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/internals/managers.py:1503, in BlockManager.grouped_reduce(self, func)
if blk.is_object:
   # split on object-dtype blocks bc some columns may raise
   #  while others do not.
   for sb in blk._split():
-> 1503         applied = sb.apply(func)
       result_blocks = extend_blocks(applied, result_blocks)
else:

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/internals/blocks.py:329, in Block.apply(self, func, **kwargs)
@final
def apply(self, func, **kwargs) -> list[Block]:
   """
   apply the function to my values; return a block if we are not
   one
   """
--> 329     result = func(self.values, **kwargs)
   return self._split_op_result(result)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1503, in GroupBy._cython_agg_general.<locals>.array_func(values)
   result = self.grouper._cython_operation(
       "aggregate",
       values,
   (...)
       **kwargs,
   )
except NotImplementedError:
   # generally if we have numeric_only=False
   # and non-applicable functions
   # try to python agg
   # TODO: shouldn't min_count matter?
-> 1503     result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
return result

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1457, in GroupBy._agg_py_fallback(self, values, ndim, alt)
   ser = df.iloc[:, 0]
# We do not get here with UDFs, so we know that our dtype
#  should always be preserved by the implemented aggregations
# TODO: Is this exactly right; see WrappedCythonOp get_result_dtype?
-> 1457 res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
if isinstance(values, Categorical):
   # Because we only get here with known dtype-preserving
   #  reductions, we cast back to Categorical.
   # TODO: if we ever get "rank" working, exclude it here.
   res_values = type(values)._from_sequence(res_values, dtype=values.dtype)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:994, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
if len(obj) > 0 and not isinstance(obj._values, np.ndarray):
   # we can preserve a little bit more aggressively with EA dtype
   #  because maybe_cast_pointwise_result will do a try/except
   #  with _from_sequence.  NB we are assuming here that _from_sequence
   #  is sufficiently strict that it casts appropriately.
   preserve_dtype = True
--> 994 result = self._aggregate_series_pure_python(obj, func)
npvalues = lib.maybe_convert_objects(result, try_float=False)
if preserve_dtype:

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/ops.py:1015, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
splitter = self._get_splitter(obj, axis=0)
for i, group in enumerate(splitter):
-> 1015     res = func(group)
   res = libreduction.extract_result(res)
   if not initialized:
       # We only do this validation on the first iteration

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1857, in GroupBy.mean.<locals>.<lambda>(x)
   return self._numba_agg_general(sliding_mean, engine_kwargs)
else:
   result = self._cython_agg_general(
       "mean",
-> 1857         alt=lambda x: Series(x).mean(numeric_only=numeric_only),
       numeric_only=numeric_only,
   )
   return result.__finalize__(self.obj, method="groupby")

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11556, in NDFrame._add_numeric_operations.<locals>.mean(self, axis, skipna, numeric_only, **kwargs)
@doc(
   _num_doc,
   desc="Return the mean of the values over the requested axis.",
   (...)
   **kwargs,
):
> 11556     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11201, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
def mean(
   self,
   axis: Axis | None = 0,
   (...)
   **kwargs,
) -> Series | float:
> 11201     return self._stat_function(
       "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
   )

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/generic.py:11158, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
   nv.validate_stat_func((), kwargs, fname=name)
validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11158 return self._reduce(
   func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/series.py:4670, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   raise TypeError(
       f"Series.{name} does not allow {kwd_name}={numeric_only} "
       "with non-numeric dtypes."
   )
with np.errstate(all="ignore"):
-> 4670     return op(delegate, skipna=skipna, **kwds)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:96, in disallow.__call__.<locals>._f(*args, **kwargs)
try:
   with np.errstate(invalid="ignore"):
---> 96         return f(*args, **kwargs)
except ValueError as e:
   # we want to transform an object array
   # ValueError message to the more typical TypeError
   # e.g. this is normally a disallowed function on
   # object arrays that contain strings
   if is_object_dtype(args[0]):

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:158, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
       result = alt(values, axis=axis, skipna=skipna, **kwds)
else:
--> 158     result = alt(values, axis=axis, skipna=skipna, **kwds)
return result

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:421, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
if datetimelike and mask is None:
   mask = isna(values)
--> 421 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
if datetimelike:
   result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:727, in nanmean(values, axis, skipna, mask)
   dtype_count = dtype
count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 727 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
if axis is not None and getattr(the_sum, "ndim", False):
   count = cast(np.ndarray, count)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/pandas/core/nanops.py:1699, in _ensure_numeric(x)
           x = complex(x)
       except ValueError as err:
           # e.g. "foo"
-> 1699             raise TypeError(f"Could not convert {x} to numeric") from err
return x

TypeError: Could not convert 2021-03-12 00:002021-03-12 00:152021-03-12 01:002021-03-12 02:152021-03-12 03:302021-03-12 04:302021-03-12 05:152021-03-12 05:302021-03-12 06:302021-03-12 06:452021-03-12 08:002021-03-12 09:452021-03-12 10:302021-03-12 11:152021-03-12 12:002021-03-12 12:152021-03-12 14:002021-03-12 14:452021-03-12 15:152021-03-12 16:002021-03-12 17:152021-03-12 17:452021-03-12 18:152021-03-12 19:002021-03-12 22:002021-03-12 23:302021-03-13 00:302021-03-13 00:452021-03-13 01:152021-03-13 01:302021-03-13 01:452021-03-13 03:152021-03-13 03:452021-03-13 04:302021-03-13 04:452021-03-13 05:452021-03-13 06:452021-03-13 07:002021-03-13 08:002021-03-13 08:152021-03-13 09:002021-03-13 09:452021-03-13 10:302021-03-13 10:452021-03-13 11:002021-03-13 11:15 to numeric

Find the mean values of all of the columns at each discharge (less meaningful, but just for practice)

water_vars.groupby("discharge").mean()

	temperature	dissolved oxygen	pH	dam release
discharge
39.0	8.385714	12.957143	8.200000	0.571429
40.8	8.027273	12.309091	8.136364	0.272727
42.6	7.261538	10.807692	7.992308	0.461538
44.5	7.665385	8.778846	8.005769	0.480769
46.5	7.314583	8.820000	7.922000	0.360000
48.5	7.008333	8.216667	7.933333	0.833333

Find the number of non-nan observations for each column at each safety level

water_vars.groupby("safety level").count()

	datetime	discharge	temperature	dissolved oxygen	pH	dam release
safety level
high	46	46	46	46	46	46
low	53	53	51	53	53	53
medium	47	46	47	47	47	47

Display the mean values of just the temperature column at each safety level

water_vars.groupby("safety level")['temperature'].mean()

safety level
high      7.426087
low       7.590196
medium    7.512766
Name: temperature, dtype: float64

Part 2#

For the next set of practice problems we are going to use the WAS 2020 SARP data. Run the following lines of code to import the data.

was_2020_filepath = "../data/SARP 2020 final.xlsx"
was_2020 = pd.read_excel(was_2020_filepath, "INPUT", skipfooter=7)

Question 1#

A) Using was_2020 find data samples where 'CO2 (ppmv)' was greater than 420.

was_2020[was_2020['CO2 (ppmv)'] > 420]

	Can #	Snake can #	Location	State	Box	Date	Time	Altitude (m)	Temp. Celsius	Wind Spd (mph)	...	3-Ethyltoluene (MS)	4-Ethyltoluene (MS)	2-Ethyltoluene (MS)	1,3,5-Trimethylbenzene (MS)	1,2,4-Trimethylbenzene (MS)	alpha-Pinene (B/MS)	beta-Pinene (B/MS)	Unnamed: 113	Box.1	CH4 (ppmv height).1
1	7338	3716	Buffalo Park, Flagstaff, Arizona	Arizona	37	2020-04-24	11:44:00	2100	18	5	...	12	16	15	14	22	31	9	NaN	37	1.916
3	5012	3702	Buffalo Park, Flagstaff, Arizona	Arizona	37	2020-05-01	11:16:00	2100	19	16	...	-888	-888	-888	-888	-888	22	-888	NaN	37	1.895
5	5061	3703	Downtown Sedona, Arizona	Arizona	37	2020-05-01	12:40:00	1350	26	8	...	-888	3	-888	-888	5	26	-888	NaN	37	1.907
13	6123	3721	Mars Hill/Lowell Observatory, Flagstaff, Arizona	Arizona	37	2020-06-03	12:50:00	2200	27	8	...	4	-888	-888	-888	6	67	16	NaN	37	1.887
15	7335	3711	Downtown Sedona, Arizona	Arizona	37	2020-06-21	17:20:00	1350	36	12	...	34	15	13	16	51	16	-888	NaN	37	1.902
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
898	9172	2211	Elizabeth Park, central Bellingham, Washington	Washington	22	2020-06-05	10:59:00	22	16.1	5	...	5	-888	-888	-888	7	5	-888	NaN	22	2.041
902	9168	2208	My backyard, rural Bellingham, Washington	Washington	22	2020-07-15	10:00:00	153	17.2	0-2mph	...	4	3	3	-888	8	121	26	NaN	22	1.916
905	5109	2601	Carl Traeger middle school, Oshkosh, Wisconsin	Wisconsin	26	2020-04-15	10:49:00	243.84	-2.2	7	...	55	25	17	19	64	47	11	NaN	26	1.968
906	7027	2616	South Park, Oshkosh, Wisconsin	Wisconsin	26	2020-04-15	11:01:00	231.64	-1.1	3	...	-888	-888	-888	-888	4	-888	-888	NaN	26	1.965
908	5131	2602	South Park, Oshkosh, Wisconsin	Wisconsin	26	2020-04-22	12:26:00	231.64	7.2	17	...	-888	-888	-888	-888	-888	3	-888	NaN	26	1.953

420 rows × 116 columns

B) How many data samples had CO2 over 420?

Question 2#

A) What is the mean value of 'Isoprene (E/B)' accross all the samples?

was_2020['Isoprene (E/B)'].mean()

346.80193756727664

B) How many dta samples had values of ‘Isoprene (E/B)’ below 10?

was_2020[was_2020['Isoprene (E/B)'] < 10]['Isoprene (E/B)']

      4
     5
  -888
     5
     6
      ... 
    5
 -888
    5
    4
    4
Name: Isoprene (E/B), Length: 157, dtype: int64

C) Print out the values of isoprene from the previous part of the question which are less than 10. Do you notice anything unusual about the values?

D) Looking at the values in the previous question you might see that several values are marked with -888. This is likely a nodata value (but its always good to check the data documentation to be sure).

Knowing that there are now nodata values, go back and ammend your code to find the rows where isoprene is less than 10 but also not equal to -888. Now how many rows are there?

E) What is the true mean value of isoprene?

This question was brought to you today as a gentle reminder to always be on the lookout for nodata 🙂

Question 3#

There are a lot of data columns in this dataset, many more than we can reasonably view in the preview. To help us explore our data better we might want to view a smaller subset of the data at once.

A) Create a dataframe that includes only the following columns from the was_2020 dataset: 'Can #', 'State', 'Date', 'CO (ppbv)', and 'OCS (MS)'.

B) Let’s say you would like to view all of the columns of CFCs together. There are several columns that all start with CFC-: ‘CFC-12 (C/D)’, ‘CFC-11 (D/MS)’, ‘CFC-113 (D)’, ‘CFC-114 (MS)’.

Create a dataframe that includes just the columns whose names start with “CFC-“.

Google help: try “pandas filter by column name string”. This stackoverflow also has several options. The first few use a method called “list comprehension” that we haven’t talked about in lecture yet. Further down are a few other options to try that don’t use that technique.

C) Use the subset of data you created in part be and return just the columns of CFCs and only the rows of samples that were taken in Arizona.

Question 4 - `groupby` questions#

A) What are the mean values of 'CO (ppbv)' from each box?

was_2020.groupby('Box')['CO (ppbv)'].mean()

Box
   128.652174
   141.217391
   125.565217
   114.333333
   138.695652
   204.454545
   182.045455
  122.217391
  240.041667
  121.818182
  136.850000
  127.083333
  147.826087
  137.863636
  108.695652
  151.958333
  119.625000
  157.739130
  150.391304
  152.304348
  115.909091
  136.125000
  120.565217
  118.380952
  154.318182
  159.541667
  111.727273
  120.285714
  121.652174
  197.045455
  155.136364
  102.125000
  132.363636
  102.625000
  101.400000
  107.521739
  155.590909
  143.000000
  112.333333
  110.260870
  133.391304
Name: CO (ppbv), dtype: float64

B) What is the minimum value of 'CH4 (ppmv height)' for each Weather condition?

was_2020.groupby('Weather')['CH4 (ppmv height)'].min()

Weather
Clear                            1.892
Clear and sunny                  1.882
Clear skies and sunny            1.936
Clear skies, sunny, warm wind    1.933
Clear/Sunny                      1.976
                                 ...  
sunny sky but hazy               1.980
sunny, but a bit hazy            1.950
sunny, hazy                      1.993
sunny, slightly hazy             1.976
windy, about to rain             1.932
Name: CH4 (ppmv height), Length: 87, dtype: float64

C) How many samples were taken in each state?

was_2020.groupby("State")["Can #"].count()

State
Arizona           22
California       204
Colorado          64
Connecticut       23
Florida           22
Georgia           23
Illinois          22
Kentucky          24
Louisiana         21
Massachusetts     46
Minnesota         24
Missouri          22
New Jersey        23
New York          94
Ohio              22
Oregon            24
Texas            113
Utah              23
Virginia          66
Washington        23
Wisconsin         24
Name: Can #, dtype: int64

For the final two questions we will switch back to the water dataset

Question 5#

.groupby also allows you to group on multiple fields. So instead of saying “what is the mean value of CO in each state?”, we can be more specific, and ask, “What is the mean value of CO in each state at every wind speed?”

We do this by using a list of values in the .groupby() argument instead of just a single column string.

.groupby(["Box", "Wind Spd (mph)"])

A) Try using this syntax to find the mean value of CO in each state at every wind speed

was_2020.groupby(["State", "Wind Spd (mph)"])['CO (ppbv)'].mean()['Illinois']

Wind Spd (mph)
3        128.0
4        124.5
5        229.5
6        122.5
7        151.0
8        153.0
9         79.0
10        97.0
11       115.0
15-25     94.0
15-26    108.0
15-27    116.0
17-26    109.0
18-26    111.0
Name: CO (ppbv), dtype: float64

B) What was the mean value of CO in Georgia when the wind speed was 8 mph?

Question 6#

The current .groupby function allows to use a single aggregation function at once – we can see either the means for every columns, or the max of every column, and so on.

To look use different aggregation functions for different columns we drop the aggregation function and instead use .agg().

Here is an example using the water dataset:

water_vars.groupby("dam release").agg(
    max_discharge=pd.NamedAgg(column='discharge', aggfunc='max'), 
    mean_doxy=pd.NamedAgg(column="dissolved oxygen", aggfunc='mean')
)

	max_discharge	mean_doxy
dam release
False	48.5	9.466667
True	48.5	9.343243

Use the .agg() function to find the maximum values of 'CH4 (ppmv height)', the mean values of ‘Toluene (E/B/MS)’, and the total number of samples for each state.

# copied improperly
# was_2020.groupby("State").agg(
#     max_discharge=pd.NamedAgg(column='discharge', aggfunc='max'), 
#     mean_doxy=pd.NamedAgg(column="dissolved oxygen", aggfunc='mean')
# )