xarray Groupby

`xarray` Groupby#

Lesson Content

groupby
groupby bins

Context#

Yesterday we began to explore working with data in xarray. Today we are going to dig into that even deeper with a concept called groupby. Groupby is going to allow us to split our data up into different categories and analyze them based on those categories. It sounds a bit abstract right now, but just wait - it’s powerful!

import xarray as xr

sst = xr.open_dataset("https://www.ncei.noaa.gov/thredds/dodsC/OisstBase/NetCDF/V2.1/AVHRR/198210/oisst-avhrr-v02r01.19821007.nc")

sst = sst['sst'].squeeze(dim='zlev', drop=True)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/file_manager.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    210 try:
--> 211     file = self._cache[self._key]
    212 except KeyError:

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/lru_cache.py:56, in LRUCache.__getitem__(self, key)
     55 with self._lock:
---> 56     value = self._cache[key]
     57     self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('https://www.ncei.noaa.gov/thredds/dodsC/OisstBase/NetCDF/V2.1/AVHRR/198210/oisst-avhrr-v02r01.19821007.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), 'de82db90-3943-48cd-b456-a705eaa5cc26']

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 sst = xr.open_dataset("https://www.ncei.noaa.gov/thredds/dodsC/OisstBase/NetCDF/V2.1/AVHRR/198210/oisst-avhrr-v02r01.19821007.nc")
      3 sst = sst['sst'].squeeze(dim='zlev', drop=True)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/api.py:570, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    558 decoders = _resolve_decoders_kwargs(
    559     decode_cf,
    560     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    566     decode_coords=decode_coords,
    567 )
    569 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 570 backend_ds = backend.open_dataset(
    571     filename_or_obj,
    572     drop_variables=drop_variables,
    573     **decoders,
    574     **kwargs,
    575 )
    576 ds = _dataset_from_backend_dataset(
    577     backend_ds,
    578     filename_or_obj,
   (...)
    588     **kwargs,
    589 )
    590 return ds

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:602, in NetCDF4BackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    581 def open_dataset(  # type: ignore[override]  # allow LSP violation, not supporting **kwargs
    582     self,
    583     filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
   (...)
    599     autoclose=False,
    600 ) -> Dataset:
    601     filename_or_obj = _normalize_path(filename_or_obj)
--> 602     store = NetCDF4DataStore.open(
    603         filename_or_obj,
    604         mode=mode,
    605         format=format,
    606         group=group,
    607         clobber=clobber,
    608         diskless=diskless,
    609         persist=persist,
    610         lock=lock,
    611         autoclose=autoclose,
    612     )
    614     store_entrypoint = StoreBackendEntrypoint()
    615     with close_on_error(store):

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:400, in NetCDF4DataStore.open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    394 kwargs = dict(
    395     clobber=clobber, diskless=diskless, persist=persist, format=format
    396 )
    397 manager = CachingFileManager(
    398     netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    399 )
--> 400 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:347, in NetCDF4DataStore.__init__(self, manager, group, mode, lock, autoclose)
    345 self._group = group
    346 self._mode = mode
--> 347 self.format = self.ds.data_model
    348 self._filename = self.ds.filepath()
    349 self.is_remote = is_remote_uri(self._filename)

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:409, in NetCDF4DataStore.ds(self)
    407 @property
    408 def ds(self):
--> 409     return self._acquire()

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:403, in NetCDF4DataStore._acquire(self, needs_lock)
    402 def _acquire(self, needs_lock=True):
--> 403     with self._manager.acquire_context(needs_lock) as root:
    404         ds = _nc4_require_group(root, self._group, self._mode)
    405     return ds

File ~/miniconda3/envs/sarp/lib/python3.10/contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    133 del self.args, self.kwds, self.func
    134 try:
--> 135     return next(self.gen)
    136 except StopIteration:
    137     raise RuntimeError("generator didn't yield") from None

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in CachingFileManager.acquire_context(self, needs_lock)
    196 @contextlib.contextmanager
    197 def acquire_context(self, needs_lock=True):
    198     """Context manager for acquiring a file."""
--> 199     file, cached = self._acquire_with_cache_info(needs_lock)
    200     try:
    201         yield file

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/xarray/backends/file_manager.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock)
    215     kwargs = kwargs.copy()
    216     kwargs["mode"] = self._mode
--> 217 file = self._opener(*self._args, **kwargs)
    218 if self._mode == "w":
    219     # ensure file doesn't get overridden when opened again
    220     self._mode = "a"

File src/netCDF4/_netCDF4.pyx:2487, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2010, in genexpr()

File src/netCDF4/_netCDF4.pyx:2010, in genexpr()

File ~/miniconda3/envs/sarp/lib/python3.10/site-packages/netCDF4/utils.py:32, in _find_dim(grp, dimname)
     28 def _sortbylist(A,B):
     29     # sort one list (A) using the values from another list (B)
     30     return [A[i] for i in sorted(range(len(A)), key=B.__getitem__)]
---> 32 def _find_dim(grp, dimname):
     33     # find Dimension instance given group and name.
     34     # look in current group, and parents.
     35     group = grp
     36     dim = None

KeyboardInterrupt: 

Groupby#

While we have lots of individual gridpoints in our dataset, sometimes we don’t care about each individual reading. Instead we probably care about the aggregate of a specific group of readings.

For example:

Given the average temperature of every county in the US, what is the average temperature in each state?
Given a list of the opening dates of every Chuck E Cheese stores, how many Chuck E Cheeses were opened each year? 🧀

In xarray we answer questions like that that with groupby.

Breaking `groupby` into conceptual parts#

In addition to the dataframe, there are three main parts to a groupby:

Which variable we want to group together
How we want to group
The variable we want to see in the end

Without getting into syntax yet we can start by identifiying these in our two example questions.

Given the average temperature of every county in the US, what is the average temperature in each state?

Which variable to group together? -> We want to group counties into states
How do we want to group? -> Take the average
What variable do we want to look at? Temperature

Given a list of the opening dates of every Chuck E Cheese stores, how many Chuck E Cheeses were opened each year?

Which variable to group together? -> We want to group individual days into years
How do we want to group? -> Count them
What variable do we want to look at? Number of stores

📝 Check your understanding

Identify each of three main groupby parts in the following scenario:

Given the hourly temperatures for a location over the course of a month, what were the daily highs?

Which variable to group together?
How do we want to group?
What variable do we want to look at?

`AGGREGATION`#

The goal with each of the groups of data is to end up with a single value for the things in that group. To tell xarray how to gather the datapoints together we specify which function we would like it to use. Any of the aggregation functions we talked about at the beginning of the lesson work for this!

sst.groupby('lat').mean(...).plot()

[<matplotlib.lines.Line2D at 0x163c46290>]

../../_images/5399c884cb88e4e3b5ebb16c2d8b8e2bd991c8d1a48fed7eedb2b0608fdf50ac.png

What do we see? Hot water near the equator and chilly water near the poles.

Note

The ellipses ... inside the .mean() tell xarray to take the mean over all of the remaining axis. You wouldn’t have to do that - you may instead want to take the mean over just the latitude and keep the time resolution. It’s quite common, though, to want to aggregate over all remaining axis.

groupby bins#

Breaking down the process#

There is a lot that happens in a single step with groupby and it can be a lot to take in. One way to mentally situate this process is to think about split-apply-combine.

split-apply-combine breaks down the groupby process into those three steps:

SPLIT the full data set into groups. Split is related to the question Which variable to group together?
APPLY the aggregation function to the individual groups. Apply is related to the question How do we want to group?
COMBINE the aggregated data into a new dataframe

xarray Groupby

Contents

`xarray` Groupby#

Context#

Groupby#

Breaking `groupby` into conceptual parts#

`groupby` syntax#

`'WHICH_GROUP'`#

`AGGREGATION`#

time dimension#

groupby bins#

Breaking down the process#

xarray Groupby

Contents

xarray Groupby#

Context#

Groupby#

Breaking groupby into conceptual parts#

groupby syntax#

'WHICH_GROUP'#

AGGREGATION#

time dimension#

groupby bins#

Breaking down the process#

`xarray` Groupby#

Breaking `groupby` into conceptual parts#

`groupby` syntax#

`'WHICH_GROUP'`#

`AGGREGATION`#