Time Series Statistics#

1) Load in your data with pandas#

import pandas as pd
gauge_data = pd.read_csv('englewood_3_21_21_usgs_modified.csv')
# If your file doesn't read in that easily you might also try using additional flags such as
pd.read_csv('englewood_3_21_21_usgs_modified.csv', header=1) # specify header row
pd.read_csv('englewood_3_21_21_usgs_modified.csv', skiprows=10) # ignore some rows at the top of the file
USGS 6711565 2021-03-12 02:15 MST 44.5 8 8.2 8.1
0 USGS 6711565 2021-03-12 02:30 MST 44.5 7.9 8.0 8.1
1 USGS 6711565 2021-03-12 02:45 MST 44.5 7.9 7.9 8.1
2 USGS 6711565 2021-03-12 03:00 MST 44.5 7.8 8.0 8.1
3 USGS 6711565 2021-03-12 03:15 MST 44.5 7.8 8.0 8.1
4 USGS 6711565 2021-03-12 03:30 MST 44.5 7.7 8.0 8.1
... ... ... ... ... ... ... ... ...
131 USGS 6711565 2021-03-13 11:15 MST 42.6 6.7 9.8 7.9
132 USGS 6711565 2021-03-13 11:30 MST 42.6 6.7 9.9 7.9
133 USGS 6711565 2021-03-13 11:45 MST 42.6 6.7 10.2 7.9
134 USGS 6711565 2021-03-13 12:00 MST 46.5 6.7 10.3 7.9
135 USGS 6711565 2021-03-13 12:15 MST NaN 6.6 10.3 7.9

136 rows × 8 columns

We see we have a pandas DataFrame with rows corresponding to 146 times and several data columns, the interesting ones being Discharge, Temperature, Dissolved oxygen and pH.

gauge_data
agency_cd site_no datetime tz_cd Discharge Temperature Dissolved oxygen pH
0 USGS 6711565 2021-03-12 00:00 MST 44.5 8.1 8.3 8.1
1 USGS 6711565 2021-03-12 00:15 MST 44.5 8.1 8.2 8.1
2 USGS 6711565 2021-03-12 00:30 MST 44.5 8.1 8.2 8.1
3 USGS 6711565 2021-03-12 00:45 MST 44.5 8.1 8.1 8.1
4 USGS 6711565 2021-03-12 01:00 MST 44.5 8.1 8.1 8.1
... ... ... ... ... ... ... ... ...
141 USGS 6711565 2021-03-13 11:15 MST 42.6 6.7 9.8 7.9
142 USGS 6711565 2021-03-13 11:30 MST 42.6 6.7 9.9 7.9
143 USGS 6711565 2021-03-13 11:45 MST 42.6 6.7 10.2 7.9
144 USGS 6711565 2021-03-13 12:00 MST 46.5 6.7 10.3 7.9
145 USGS 6711565 2021-03-13 12:15 MST NaN 6.6 10.3 7.9

146 rows × 8 columns

Note

A Note about using pandas

You get data from a single column using the following notation:

gauge_data['Discharge']

This will be helpful while plotting.

2) Plotting a time series#

import matplotlib.pyplot as plt

There are two plots below. Plot 1 is the simpliest way to make a plot. Because with my data I have a lot of x axis labels and they are over lapping eachother in plot 2 I add some code to format my x labels to make them nicer. For your data I would start with the technique in Plot 1 and only use the Plot 2 axis labels code if your x axis labels are hard to read.

The plotting methods will work for lots of types of python data (i.e. anything that is an numpy array).

# Plot 1
# inputs: x values, y values
plt.plot(gauge_data['datetime'], gauge_data['Discharge'])
[<matplotlib.lines.Line2D at 0x1592f8400>]
../../_images/5724ae1f50a184d80213d10de02db2e7e36058071cb0bd4a8bf1e76f333f2c96.png
# Plot 2
# inputs: x values, y values
plt.plot(gauge_data['datetime'], gauge_data['Discharge'])

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical
../../_images/4b687ed2213ab119d458acb8bff77bd518df095dcde146b39177b9b5342cbb1b.png

Optional Extra: A pandas-specific way to plot#

This plotting method is faster but only works for pandas dataframes. The benefit is that you get the lagend automatically.

# Plot 1
# inputs: x column name, y column name
gauge_data.plot('datetime', 'Discharge')
<Axes: xlabel='datetime'>
../../_images/c616f8e18fa873ceb578325dbbbca08d06b0d775ad173aa4e2e366ac14da8fcc.png
# Plot 2
# inputs: x column name, y column name
gauge_data.plot('datetime', 'Discharge')

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical
../../_images/b0e04f2e2e7adac2a8ea49ec86ed67242b9eb27d5a476cf783f20eaa1dbb0ab0.png

3) Cumulative sum#

Subtracting the mean value as a baseline#

In Raphe’s examples the mean value of a large value was used as the baseline and then used to compute the anomaly of each value. Here we use the mean of our dataset (even though the record isn’t that long, hopefully it conveys the idea).

baseline_discharge = gauge_data['Discharge'].mean()
gauge_data['discharge_anomaly'] = gauge_data['Discharge'] - baseline_discharge
gauge_data['discharge_anomaly']
0     -0.304138
1     -0.304138
2     -0.304138
3     -0.304138
4     -0.304138
         ...   
141   -2.204138
142   -2.204138
143   -2.204138
144    1.695862
145         NaN
Name: discharge_anomaly, Length: 146, dtype: float64

Compute the cumulative sum#

import numpy as np
# inputs: the data variable you want to sum
discharge_csum = np.cumsum(gauge_data['discharge_anomaly'])
discharge_csum
0     -3.041379e-01
1     -6.082759e-01
2     -9.124138e-01
3     -1.216552e+00
4     -1.520690e+00
           ...     
141    2.712414e+00
142    5.082759e-01
143   -1.695862e+00
144    5.684342e-14
145             NaN
Name: discharge_anomaly, Length: 146, dtype: float64

Plot the cumulative sum#

import matplotlib.pyplot as plt
plt.title('Cumulative Sum Discharge')
# inputs: x values, y values
plt.plot(gauge_data['datetime'], discharge_csum)

ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))  # Use only 4 labels
ax.tick_params(axis='x', labelrotation = 80)  # Rotate the labels so the are more vertical
../../_images/73c84143f4eeca18258d571f155c44b35e3ceaaa2036f28a57ed56ba546e904c.png

Kendall’s Tau#

docs page

from scipy import stats
# inputs: variable 1, variable 2, (OPTIONAL nan_policy if your dataset has nan values)
tau, p_value = stats.kendalltau(gauge_data['Discharge'], gauge_data['pH'], nan_policy='omit')
print('tau: ', tau, ' p_value: ', p_value)
tau:  -0.5386593699749752  p_value:  1.1943988281013273e-14

Scatter plot of the Discharge and pH#

gauge_data.plot.scatter('Discharge', 'pH')
plt.title('pH vs. Discharge')
Text(0.5, 1.0, 'pH vs. Discharge')
../../_images/3b9a7154d1bb004c435c7c94b810c01995313d2a1c6da200573131c54d7b0de1.png

Theil Sen Estimator#

docs page

from scipy import stats

If needed drop rows with nans

gauge_data = gauge_data.dropna(subset=['Discharge', 'pH'])

Compute the slope values for the best fit line and confidence interval lines.

Output res returns four values:

  1. slope of the Theil line

  2. intercept of the Theil line

  3. slope of lower bound of confidence on the Theil line

  4. slope of upper bound of confidence on the Theil line

# inputs are: y variable, x variable, confidence interval (value between 0 and 1)
res = stats.theilslopes(gauge_data['pH'], gauge_data['Discharge'], 0.90)
# Here I used the index as a replacement for time
print(res)
TheilslopesResult(slope=-0.02702702702702691, intercept=9.202702702702698, low_slope=-0.03636363636363624, high_slope=-0.02564102564102556)

Plot the lines#

import matplotlib.pyplot as plt

Use the slope and y intercept values from the stats.theilslopes() method and plot the lines using the form y = b + m*x

# Inputs for all plots: x values, y values, (optional color preferences)
# base scatter plot
plt.scatter(gauge_data['Discharge'], gauge_data['pH'])
# Theil slope
plt.plot(gauge_data['Discharge'], res[1] + res[0]*gauge_data['Discharge'], color='red')
# lower confidence line
plt.plot(gauge_data['Discharge'], res[1] + res[2]*gauge_data['Discharge'], '--k')
# upper confidence line
plt.plot(gauge_data['Discharge'], res[1] + res[3]*gauge_data['Discharge'], '--k')
[<matplotlib.lines.Line2D at 0x15c5f5870>]
../../_images/bbade386a49e6cb8b7e480c3bcc471ab5651bae11c157cba4101a539dacd3644.png

More color preferences for the plot are listed in the “Notes” section of the plt.plot docs.