Hierarchical Data Structure

Hierarchical Data Structure#

In order to use the data in this notebook you will need to unzip the three files in this folder.

Content

Background
HDF4 example
HDF5 example
netCDF example

File Format Structure#

Concept No. 1: pros and cons of lenient file formats#

File Format Spectrum

	Pro	Con
Strict File Format	easy to know what you’re going to get	doesn’t handle all data types
Not Strict File Format	handles lots of data types	what’s inside is unpredictable

Concept No. 2: organizing variables by dimension#

A 3-dimensional dataset

A 4-dimensional dataset

Images from Fundamentals of NetCDF Storage by ESRI

When variables are defined in netCDF files they are also assigned dimensions. Dimensions tell us the axis over which our data varies. Common examples are latitute, longitude, or time. The values of the dimensions are given in special variables called coordinate variables. The coordinate variables and dimensions help us understand what the core data stored in each variable is describing. Attributes store metadata about our variables.

Concept No. 3: groups and datasets#

In the previous raster lessons we have been using data where the organization is a single dataset per file. HDF and netCDF are unique in that they allow multiple datasets to be in the same file. To keep organized, datasets are allowed to be stored together in groups. An analogy is to think of groups like folders in a folder structure and datasets as the individual files. Groups can have more groups inside of them.

HDF4#

Install#

In Anaconda Powershell:

conda install -c conda-forge -n lessons pyhdf

Documentation#

http://fhs.github.io/pyhdf/modules/SD.html (My opinion: pretty terrible docs)

Example dataset: CALIPSO#

CALIPSO Level 2: CAL_LID_L2_01kmCLay-Standard-V4-21.2020-07-01T07-32-43ZD.hdf

download link

Opening the Dataset#

from pyhdf.SD import *

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from pyhdf.SD import *

ModuleNotFoundError: No module named 'pyhdf'

filepath = './CAL_LID_L2_01kmCLay-Standard-V4-21.2020-07-01T07-32-43ZD.hdf'

# Opening the file
d = SD(filepath)

Exploring the dataset#

d.datasets()

Choose a key from the .datasets() dictionary and get the varaible info using .select()

d.select('Integrated_Attenuated_Backscatter_532')

Getting your data#

The data can be accessed using [:] and is output as a numpy array. Any of the methods we practiced with numpy arrays in lecture can be used on the dataset.

d.select('Integrated_Attenuated_Backscatter_532')[:]

backscatter = d.select('Integrated_Attenuated_Backscatter_532')[:]

type(backscatter)

backscatter.shape

backscatter.max()

Attributes#

Get metadata with the .attributes() method.

d.select('Integrated_Attenuated_Backscatter_532').attributes()

Masking a no data value#

While HDF5 data will automatically mask out nodata values, HDF4 datasets often don’t. To mask them yourself you can look up the fill value and apply it to the array.

import numpy.ma as ma

# Mask the array
masked_backscatter = ma.masked_where(backscatter == -9999, backscatter)
# Update the nodata value
ma.set_fill_value(backscatter, -9999)

masked_backscatter

HDF5#

Install#

In Anaconda Powershell:

conda install -c conda-forge -n lessons h5py

Documentation#

https://docs.h5py.org/en/stable/quick.html

Example Dataset: ASTER Emissivity#

AG100.v003.83.-013.0001.h5

Attempting open with xarray#

# Returns empty
xr.open_dataset(filepath)

# Specify group.  If dataset is nested you can do /Emissivity/group2
xr.open_dataset(filepath, group='Emissivity')
# This also works for netCDF

<xarray.Dataset>
Dimensions:  (phony_dim_2: 5, phony_dim_3: 1000, phony_dim_4: 1000)
Dimensions without coordinates: phony_dim_2, phony_dim_3, phony_dim_4
Data variables:
    Mean     (phony_dim_2, phony_dim_3, phony_dim_4) int16 ...
    SDev     (phony_dim_2, phony_dim_3, phony_dim_4) int16 ...

Opening a Dataset#

import h5py

filepath = './AG100.v003.83.-013.0001.h5'

f = h5py.File(filepath, 'r')

<HDF5 file "AG100.v003.83.-013.0001.h5" (mode r)>

Exploring Groups#

f.keys()

<KeysViewHDF5 ['ASTER GDEM', 'Emissivity', 'Geolocation', 'Land Water Map', 'NDVI', 'Observations', 'Temperature']>

f['Emissivity']

f['Emissivity'].keys()

f['Emissivity']['Mean']

You can check where you are in the file hierarchy with the .name method

f.name

f['Emissivity'].name

f['Emissivity']['Mean'].name

Getting your data#

The data inside the data group dictionaries are numpy arrays, so you can use any of the methods we learned about in other lectures with them.

mean_emissivity = f['Emissivity']['Mean'][:]

type(mean_emissivity)

mean_emissivity.shape

mean_emissivity.max()

from matplotlib import pyplot

pyplot.imshow(mean_emissivity[0])

Attributes#

Metadata in HDF files are called attributes and are accessed with .attrs

f['Emissivity']['Mean'].attrs.keys()

f['Emissivity']['Mean'].attrs['Description']

If there are no attributes for that group you will just get back an empty list

# No attributes on the Emissivity group
f['Emissivity'].attrs.keys()

# No attributes on the root group
f.attrs.keys()

Hierarchical Data Structure

Contents

Hierarchical Data Structure#

File Format Structure#

Concept No. 1: pros and cons of lenient file formats#

Concept No. 2: organizing variables by dimension#

Concept No. 3: groups and datasets#

HDF4#

Install#

Documentation#

Example dataset: CALIPSO#

Opening the Dataset#

Exploring the dataset#

Getting your data#

Attributes#

Masking a no data value#

HDF5#

Install#

Documentation#

Example Dataset: ASTER Emissivity#

Attempting open with xarray#

Opening a Dataset#

Exploring Groups#

Getting your data#

Attributes#

netCDF#

Install#

Documentation Link#

Example Dataset: MODIS Chlorophyll-a#

Opening a Dataset - `xarray`#

Opening a Dataset - `netCDF4`#

Getting your data#

Raster Data Structure#

Hierarchical Data Structure

Contents

Hierarchical Data Structure#

File Format Structure#

Concept No. 1: pros and cons of lenient file formats#

Concept No. 2: organizing variables by dimension#

Concept No. 3: groups and datasets#

HDF4#

Install#

Documentation#

Example dataset: CALIPSO#

Opening the Dataset#

Exploring the dataset#

Getting your data#

Attributes#

Masking a no data value#

HDF5#

Install#

Documentation#

Example Dataset: ASTER Emissivity#

Attempting open with xarray#

Opening a Dataset#

Exploring Groups#

Getting your data#

Attributes#

netCDF#

Install#

Documentation Link#

Example Dataset: MODIS Chlorophyll-a#

Opening a Dataset - xarray#

Opening a Dataset - netCDF4#

Getting your data#

Raster Data Structure#

Opening a Dataset - `xarray`#

Opening a Dataset - `netCDF4`#