Time Series Statistics

Time Series Statistics#

Content

Load in your data
Plotting a time series
Cumulative Sum
Kendall’s Tau
Theil Sen Estimator

1) Load in your data with `pandas`#

import pandas as pd

gauge_data = pd.read_csv('englewood_3_21_21_usgs_modified.csv')

# If your file doesn't read in that easily you might also try using additional flags such as
pd.read_csv('englewood_3_21_21_usgs_modified.csv', header=1) # specify header row
pd.read_csv('englewood_3_21_21_usgs_modified.csv', skiprows=10) # ignore some rows at the top of the file

	USGS	6711565	2021-03-12 02:15	MST	44.5	8	8.2	8.1
0	USGS	6711565	2021-03-12 02:30	MST	44.5	7.9	8.0	8.1
1	USGS	6711565	2021-03-12 02:45	MST	44.5	7.9	7.9	8.1
2	USGS	6711565	2021-03-12 03:00	MST	44.5	7.8	8.0	8.1
3	USGS	6711565	2021-03-12 03:15	MST	44.5	7.8	8.0	8.1
4	USGS	6711565	2021-03-12 03:30	MST	44.5	7.7	8.0	8.1
...	...	...	...	...	...	...	...	...
131	USGS	6711565	2021-03-13 11:15	MST	42.6	6.7	9.8	7.9
132	USGS	6711565	2021-03-13 11:30	MST	42.6	6.7	9.9	7.9
133	USGS	6711565	2021-03-13 11:45	MST	42.6	6.7	10.2	7.9
134	USGS	6711565	2021-03-13 12:00	MST	46.5	6.7	10.3	7.9
135	USGS	6711565	2021-03-13 12:15	MST	NaN	6.6	10.3	7.9

136 rows × 8 columns

We see we have a pandas DataFrame with rows corresponding to 146 times and several data columns, the interesting ones being Discharge, Temperature, Dissolved oxygen and pH.

gauge_data

	agency_cd	site_no	datetime	tz_cd	Discharge	Temperature	Dissolved oxygen	pH
0	USGS	6711565	2021-03-12 00:00	MST	44.5	8.1	8.3	8.1
1	USGS	6711565	2021-03-12 00:15	MST	44.5	8.1	8.2	8.1
2	USGS	6711565	2021-03-12 00:30	MST	44.5	8.1	8.2	8.1
3	USGS	6711565	2021-03-12 00:45	MST	44.5	8.1	8.1	8.1
4	USGS	6711565	2021-03-12 01:00	MST	44.5	8.1	8.1	8.1
...	...	...	...	...	...	...	...	...
141	USGS	6711565	2021-03-13 11:15	MST	42.6	6.7	9.8	7.9
142	USGS	6711565	2021-03-13 11:30	MST	42.6	6.7	9.9	7.9
143	USGS	6711565	2021-03-13 11:45	MST	42.6	6.7	10.2	7.9
144	USGS	6711565	2021-03-13 12:00	MST	46.5	6.7	10.3	7.9
145	USGS	6711565	2021-03-13 12:15	MST	NaN	6.6	10.3	7.9

146 rows × 8 columns

Note

A Note about using pandas

You get data from a single column using the following notation:

gauge_data['Discharge']

This will be helpful while plotting.

2) Plotting a time series#

import matplotlib.pyplot as plt

There are two plots below. Plot 1 is the simpliest way to make a plot. Because with my data I have a lot of x axis labels and they are over lapping eachother in plot 2 I add some code to format my x labels to make them nicer. For your data I would start with the technique in Plot 1 and only use the Plot 2 axis labels code if your x axis labels are hard to read.

The plotting methods will work for lots of types of python data (i.e. anything that is an numpy array).

# Plot 1
# inputs: x values, y values
plt.plot(gauge_data['datetime'], gauge_data['Discharge'])

[<matplotlib.lines.Line2D at 0x1592f8400>]

../../_images/5724ae1f50a184d80213d10de02db2e7e36058071cb0bd4a8bf1e76f333f2c96.png

# Plot 2
# inputs: x values, y values
plt.plot(gauge_data['datetime'], gauge_data['Discharge'])

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical

../../_images/4b687ed2213ab119d458acb8bff77bd518df095dcde146b39177b9b5342cbb1b.png

Optional Extra: A pandas-specific way to plot#

This plotting method is faster but only works for pandas dataframes. The benefit is that you get the lagend automatically.

# Plot 1
# inputs: x column name, y column name
gauge_data.plot('datetime', 'Discharge')

<Axes: xlabel='datetime'>

../../_images/c616f8e18fa873ceb578325dbbbca08d06b0d775ad173aa4e2e366ac14da8fcc.png

# Plot 2
# inputs: x column name, y column name
gauge_data.plot('datetime', 'Discharge')

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical

../../_images/b0e04f2e2e7adac2a8ea49ec86ed67242b9eb27d5a476cf783f20eaa1dbb0ab0.png

3) Cumulative sum#

Subtracting the mean value as a baseline#

In Raphe’s examples the mean value of a large value was used as the baseline and then used to compute the anomaly of each value. Here we use the mean of our dataset (even though the record isn’t that long, hopefully it conveys the idea).

baseline_discharge = gauge_data['Discharge'].mean()

gauge_data['discharge_anomaly'] = gauge_data['Discharge'] - baseline_discharge

gauge_data['discharge_anomaly']

   -0.304138
   -0.304138
   -0.304138
   -0.304138
   -0.304138
         ...   
 -2.204138
 -2.204138
 -2.204138
  1.695862
       NaN
Name: discharge_anomaly, Length: 146, dtype: float64

Compute the cumulative sum#

import numpy as np

# inputs: the data variable you want to sum
discharge_csum = np.cumsum(gauge_data['discharge_anomaly'])

discharge_csum

   -3.041379e-01
   -6.082759e-01
   -9.124138e-01
   -1.216552e+00
   -1.520690e+00
           ...     
  2.712414e+00
  5.082759e-01
 -1.695862e+00
  5.684342e-14
           NaN
Name: discharge_anomaly, Length: 146, dtype: float64

Plot the cumulative sum#

import matplotlib.pyplot as plt

plt.title('Cumulative Sum Discharge')
# inputs: x values, y values
plt.plot(gauge_data['datetime'], discharge_csum)

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical

../../_images/73c84143f4eeca18258d571f155c44b35e3ceaaa2036f28a57ed56ba546e904c.png

Kendall’s Tau#

docs page

from scipy import stats

# inputs: variable 1, variable 2, (OPTIONAL nan_policy if your dataset has nan values)
tau, p_value = stats.kendalltau(gauge_data['Discharge'], gauge_data['pH'], nan_policy='omit')
print('tau: ', tau, ' p_value: ', p_value)

tau:  -0.5386593699749752  p_value:  1.1943988281013273e-14

Scatter plot of the Discharge and pH#

gauge_data.plot.scatter('Discharge', 'pH')
plt.title('pH vs. Discharge')

Text(0.5, 1.0, 'pH vs. Discharge')

../../_images/3b9a7154d1bb004c435c7c94b810c01995313d2a1c6da200573131c54d7b0de1.png

Theil Sen Estimator#

docs page

from scipy import stats

If needed drop rows with nans

gauge_data = gauge_data.dropna(subset=['Discharge', 'pH'])

Compute the slope values for the best fit line and confidence interval lines.

Output res returns four values:

slope of the Theil line
intercept of the Theil line
slope of lower bound of confidence on the Theil line
slope of upper bound of confidence on the Theil line

# inputs are: y variable, x variable, confidence interval (value between 0 and 1)
res = stats.theilslopes(gauge_data['pH'], gauge_data['Discharge'], 0.90)
# Here I used the index as a replacement for time
print(res)

TheilslopesResult(slope=-0.02702702702702691, intercept=9.202702702702698, low_slope=-0.03636363636363624, high_slope=-0.02564102564102556)

Plot the lines#

import matplotlib.pyplot as plt

Use the slope and y intercept values from the stats.theilslopes() method and plot the lines using the form y = b + m*x

# Inputs for all plots: x values, y values, (optional color preferences)
# base scatter plot
plt.scatter(gauge_data['Discharge'], gauge_data['pH'])
# Theil slope
plt.plot(gauge_data['Discharge'], res[1] + res[0]*gauge_data['Discharge'], color='red')
# lower confidence line
plt.plot(gauge_data['Discharge'], res[1] + res[2]*gauge_data['Discharge'], '--k')
# upper confidence line
plt.plot(gauge_data['Discharge'], res[1] + res[3]*gauge_data['Discharge'], '--k')

[<matplotlib.lines.Line2D at 0x15c5f5870>]

../../_images/bbade386a49e6cb8b7e480c3bcc471ab5651bae11c157cba4101a539dacd3644.png

More color preferences for the plot are listed in the “Notes” section of the plt.plot docs.