In last week’s example notebook, we created a dummy dataset to demonstrate how one could calculate a long-term averaged seasonal cycle. This week, let’s replace the dummy data with some real data from NDBC Buoy 44025 in the Mid-Atlantic.

Personally, I love using NDBC buoy data for educational applications. In particular, it has a number of advantages:

- NDBC data is free and (relatively) easy to use. Well, a lot of data is free, but NDBC also provides data in easy-to-use standard formats, like text files, netcdf, and OpenDAP which is perfect for use in Python.
- Datasets from hundreds of buoys and stations around the world are available, which allows students to investigate location related questions.
- The datasets feature a lot of meteorological parameters, which students are generally more familiar with. Familiarity helps when students are also trying to develop programming and data analysis skills.
- But it’s not all meteorological data. You can also find waves, sea surface temperatures, and tides for a lot of stations, and a smaller subset also includes salinity, DO, and pH data. Plus, winds and barometric pressure data are also helpful in identifying storms and understanding current movements, which impact the ocean.

So, let’s dive into this dataset, grab some data to calculate a seasonal sea surface temperature (SST) average. And then just for fun, we’ll also calculate the recent SST anomaly to see where we are today relative to the last 10 years.

By Sage Lichtenwalner, March 31, 2020

In this notebook, we will demonstrate how to easily retrieve meteorological data from the National Data Buoy Center and calculate a daily average and anomaly.

In [1]:

```
# Notebook setup
import xarray as xr
!pip install netcdf4
import matplotlib.pyplot as plt
# This makes the plots prettier
import seaborn as sns
sns.set()
sns.set(rc={'figure.figsize':(8,6)})
# This removes an annoying warning
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
```

NDBC's chief mission is to collect weather data to support the National Weather Service and its operational forecast models. NDBC maintains a fleet of buoys and shore stations that collect a variety of atmospheric and oceanographic measurements. In addition, they also serve as a data repository for a number of other partner observing systems, like IOOS. With over 1000 stations all over the world, NDBC is a great resource for those looking to play with some ocean data.

The NDBC website provides some basic displays of recent data. It also includes downloadable text files of archived data by year, which can be fun to play with in Excel (though not if you want to aggregate a number of different years or stations).

Thankfully, NDBC also provides a DODS service (aka THREDDS or OPeNDAP) that makes it easy to access their full archive of netcdf data files, which is perfect for programming.

In general, the *Standard Meteorological data* files are the best place to start, as they include air and sea surface temperatures, winds, barometric pressure and waves. The *Oceanographic data* stations are more relevant to oceanographers, but only a few stations are available

- From the NDBC DODS page, select the link for
*stdmet* - Click on the station you are interested in. The NDBC homepage has a station map and search feature to help you find the right id.
- Next you will see a list of files for each year the buoy/station was deployed. If you only want one year of data, you can select that year. If you want real-time data, select the 9999 file. If you would like the full aggregated archive, select the .ncml file.
- Click on the OPENDAP link.
- Copy the "Data URL" link, and paste that to your notebook.

For this example, I'll use my favorite buoy 44025. (Doesn't everyone have a favorite buoy?)

In [0]:

```
# The OPENDAP URL for the station we want
url = 'https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml'
```

In [0]:

```
# Open the dataset using xarray, look how simple!
ds = xr.open_dataset(url)
```

In [4]:

```
# Quick list of available variaibles
ds.data_vars
```

Out[4]:

In [0]:

```
# Limit to last decade
ds = ds.sel(time=slice('2010-01-01', '2020-01-01'))
```

In [6]:

```
# Quickplot of SST
ds.sea_surface_temperature.plot()
plt.title('NDBC Buoy 44025');
```

In [7]:

```
# Calculate and Plot Daily Average
daily_sst = ds.sea_surface_temperature.load().resample(time='1D').mean()
daily_sst.plot()
plt.title('Daily Average SST at NDBC Buoy 44025')
plt.xlabel('')
plt.ylabel('Sea Surface Temperature (C)');
```

Now that we have a decade of data, let's calculate a seasonal cycle which we can then use to calculate a daily anomaly measurement.

While xarray Datasets are great for accessing data, Pandas Dataframes provide a bit more functionality for this, so our first step will be to convert our Dataset to a Dataframe.

In [8]:

```
# Convert Dataset to Dataframe
df = ds.to_dataframe()
df = df.droplevel(['latitude','longitude']) # Drop extra indices that we don't need
df.head()
```

Out[8]:

In [0]:

```
# Add a yearday column
df['yearday'] = df.index.dayofyear
```

In [0]:

```
# Calculate Annual Cycle
avg_sst = df.sea_surface_temperature.groupby(df.yearday).mean()
```

In [11]:

```
# Plot data by Yearday
plt.plot(df.yearday,df.sea_surface_temperature,'.',markersize=1,label='Raw Hourly Data');
avg_sst.plot(linewidth=3,label='10 Year Average')
plt.legend()
plt.xlabel('Day of Year')
plt.ylabel('Sea Surface Temperature (C)')
plt.title('Seasonal Cycle of Sea Surface Temperature at NDBC 44025 from 2010-2019');
plt.savefig('NDBC_44025_Seasonal_SST.png');
```

Now that we have calculated the 10-year average, we can use it to calculate a daily anomaly.

In [12]:

```
# Calculate daily average
df_daily = df.resample('1D').mean()
df_daily['yearday'] = df_daily.index.dayofyear
df_daily.head()
```

Out[12]:

In [0]:

```
# Calculate SST Anomoly based on the 10-year average
df_daily['sst_climate'] = avg_sst[df_daily.yearday].values
df_daily['sst_anomoly'] = df_daily['sea_surface_temperature'] - df_daily['sst_climate']
```

In [14]:

```
# Anomoly Plot
df_daily['sea_surface_temperature'].plot(label='Raw Data')
df_daily['sst_climate'].plot(label='Climatic Prediction')
df_daily['sst_anomoly'].plot(label='Anomoly')
plt.legend(loc='upper left');
plt.xlabel('')
plt.ylabel('Sea Surface Temperature (C)')
plt.title('Sea Surface Temperature and Anomoly at NDBC 44025');
```

Finally, let's go back to the beginning, and instead of pulling 10 years of data, we will pull the last year and use the 10-year seasonal cycle to see what the anomaly looks like.

In [0]:

```
# Open the dataset and subset to the last 3 months
realtime = xr.open_dataset(url)
realtime = realtime.sel(time=slice('2019-01-01', '2020-04-01'))
# Convert to Dataframe
realtime = realtime.to_dataframe()
realtime = realtime.droplevel(['latitude','longitude']) # Drop extra indices
# Calculate daily average
realtime = realtime.resample('1D').mean()
realtime['yearday'] = realtime.index.dayofyear
# Add SST Anomoly based on the 10-year average
realtime['sst_climate'] = avg_sst[realtime.yearday].values
realtime['sst_anomoly'] = realtime['sea_surface_temperature'] - realtime['sst_climate']
```

In [16]:

```
# Finally, let's plot it!
fig = plt.figure(constrained_layout=True)
# Rearrange the subplots so we have 1 big and 1 small graph
gs = plt.GridSpec(nrows=3,ncols=1, figure=fig)
ax1 = fig.add_subplot(gs[0:2])
ax2 = fig.add_subplot(gs[2:3])
ax1.plot(realtime['sea_surface_temperature'],linewidth=2,label='Measured Data')
ax1.plot(realtime['sst_climate'],linewidth=2,label='Climatic Average')
ax2.plot(realtime['sst_anomoly'],linewidth=2,label='Anomoly')
ax1.legend(loc='upper left');
ax1.set_ylabel('Sea Surface Temperature (C)')
ax1.set_title('Sea Surface Temperature and Anomoly at NDBC 44025');
ax2.set_ylabel('Temperature (C)')
ax2.set_title('Anomoly')
plt.savefig('NDBC_44025_SST_Anomoly.png');
```

Based on this, it looks like the ocean in the Mid-Atlantic was quite a bit warmer than average (around 1.5 degrees!) in February and March of 2020. Last year at this time, temperatures were about 0.5 to 1 degree below the recent normal, though there was a lot of variability the rest of the year.

In this new series of posts, I hope to bring you a number of Python examples that can help you and your students learn some of the ins-and-outs of using Python for Oceanographic data analysis, especially when it comes to working with OOI data. Many of the examples will come from the work we’ve done to develop datasets for our latest Data Explorations, but we’ll save those for another day.

For our first example, we’re going to create a relatively simple **seasonal dataset**, and then use some basic data analysis techniques (**detrending**, and **annual cycle**) to model the dataset. There’s nothing too profound here, and indeed, there are probably many ways to do this better (in fact, most of the other ways are probably better). But our goal here is to keep things simple, and as you’ll see, even simple gets complicated very fast!

So, please check out the tutorial below, and I encourage you to download this notebook or open it up in Google Colab directly so you can play with it yourself. There are a few variables at the top you can easily tweak.

By Sage Lichtenwalner - March 24, 2020

In this notebook, we will first create a dummy dataset that includes:

- a simple seasonal cycle (generated using the cosine function),
- a multi-year trend,
- and random noise.

Then, we will use this dataset to demonstrate how one can pull out (that is to say, model) a linear trend and seasonal cycle.

Modeling seasonal cycles and trends is a fundamental part of data analysis, in oceanography or many other fields. If you have multiple years of data, and that data varies on a regularly recurring cycle (annually, seasonally, monthly, daily or any other period), knowing how to model those cycles is important for understanding the processes that affect the measured data and ultimately making forecasts.

Note, this is a simple example. There are more robust and comprehensive libraries one can use for more robust time series analysis (including statsmodel and Prophet).

In [0]:

```
# Notebook setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# This makes the plots prettier
import seaborn as sns
sns.set()
# This removes an annoying warning
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
```

In [0]:

```
# Setup a 5-year timeseries dataset
df = pd.DataFrame(index=pd.date_range('2015-01-01','2020-01-01'))
```

In [0]:

```
# Use the cosine function to calculate an annual seasonal cycle
mean = 5
offset = 10
df['y1'] = np.cos(df.index.dayofyear/365*2*np.pi - np.pi)*mean + offset
```

In [4]:

```
df['y1'].plot();
```

In [0]:

```
# Add a trend to the signal
trend = 5 # Increaes over full range
df['y2'] = df['y1'] + trend * (df.index - df.index[0])/np.timedelta64(5,'Y')
# This is an alternative approach
# df['y2'] = df['y1'] + trend*np.arange(0,df.shape[0])/df.shape[0]
```

In [6]:

```
df.plot();
```

In [0]:

```
# Finally, let's add some random noise
noise_mean = 0
noise_var = 2
df['y3'] = df['y2'] + np.random.normal(noise_mean, noise_var, df.shape[0])
```

In [8]:

```
df['y3'].plot(marker='.',markersize=2,linestyle='');
plt.title('Example Seasonal Dataset with Trend');
```

Now that we have an example dataset, let's reverse the process with some good old-fashioned data analysis techniques.

One of the most common first steps is to detrend your dataset. This removes large-scale long-term changes that may limit the effectiveness other analysis techniques.

But your first question should be, is there a noticeable trend? (You did plot the raw data first, right?)

In the final graph of the example dataset above, you can see a trend, but it is somewhat obscured by the variability in the data. One way to make this clearer is to smooth out the data. A rolling average is one quick way to do this.

In [9]:

```
# Is there a trend? Let's plot a rolling average to see
df['y3'].rolling(30).mean().plot();
```

In this plot, we can see the long-term trend much more clearly. And we can also already see a clear seasonal cycle and the fact that the nose appears to be on a much smaller time-scale than the seasonal cycle. In fact, the monthly averaging sees to have mostly removed the impact of the noise.

Detrending a dataset like this can easily be accomplished by fitting it to a linear model of the form

`y = ax + b`

There are a number of function that can do this, including numpy.polyfit().

In [10]:

```
# Fit to a linear trend
df2 = df.reset_index() #This doesn't work for time indexes, so we need to reset
coeff = np.polyfit(df2.index,df2.y3,deg=1)
print('Slope (a): %f' % coeff[0])
print('Offset (b): %f' % coeff[1])
```

The polyfit function returns the coefficient to the equation above. The first element is the slope and the second is the mean.

But remember, the slope is per day because our x-values are daily. So let's calculate the annual change by multiplying by 365.

In [11]:

```
coeff[0]*365
```

Out[11]:

From this, we see the increase is about 1 per year, which matches the trend we specified above of 5 over 5 years.

Now that we have a model for the trend, we can use it to back out the trend from our original dataset.

In [12]:

```
# Now let's remove the trend
model = np.poly1d(coeff)
df['trend'] = model(df2.index) # Remember, this model uses the the interger x axis, not time
df['y3'].plot(marker='.',linestyle='',markersize=3,label='Raw Data')
df['trend'].plot(label='Linear Trend');
plt.legend();
```

In [13]:

```
# Calculate and plot the residuals
df['residual'] = df['y3'] - df['trend']
df['residual'].plot(marker='.',linestyle='',markersize=3);
```

There are a number of ways to model cyclic processes, like a seasonal cycle. If you know the underlying process behind your data (like the harmonic tides which can be predicted using astronomical harmonics), it's always best to use that.

But in our case, let's say we don't know what that process is, only that it repeats each year. So, let's develop a simple seasonal model that has an expected value for each yearday (the day of the year from 1-365), which we will calculate based on the mean of each yearday from the 5-year dataset.

In [0]:

```
# Add a yearday column
df['yearday'] = df.index.dayofyear
```

In [15]:

```
# Plot data by yearday
plt.plot(df.yearday,df.residual,'.',markersize=3);
plt.xlabel('Day of Year');
```

In [16]:

```
# Calculate seasonal cycles
yr = df.residual.groupby(df.yearday).mean()
yr.plot(label='1-day average')
yr7 = yr.rolling(7,center=True,min_periods=4).mean()
yr7.plot(label='7-day average')
plt.legend();
plt.xlabel('Day of Year');
```

Because we only have 5 values for each yearday (remember, we setup our dataset as daily data for 5-years), our model would be a bit noisy if we only used 1-day averages.

In order to smooth our model a bit more, we will use the 7-day rolling average.

In [17]:

```
# Create and plot our seasonal prediction
df['seasonal'] = yr7[df.yearday].values
df['seasonal'].plot();
```

In [18]:

```
# Now let's compare our model to our original dataset
df['y3'].plot(label='Dataset',marker='.',linestyle='',markersize=3)
(df['trend'] + df['seasonal']).plot(label='Model')
plt.title('Measured and Modelled Seasonal Datasets')
plt.legend();
plt.savefig('Seasonal Model.png');
```

In [19]:

```
# Calculate the residuals from the full model (trend + seasonal cycle)
df['residual2'] = df['y3'] - df['trend'] - df['seasonal']
# Plot final residuals
df['residual2'].plot();
plt.title('Model Residuals');
```

From this plot, it looks like the residuals are a bit noisy, that is, there doesn't seem to be a discernible process beyond random noise (though there are many different kinds of random processes one could investigate), so our model looks like it might be a good fit.

If we look at the statistics of the noise...

In [20]:

```
df['residual2'].describe()
```

Out[20]:

We note that the mean is very close to 0, and the standard deviation is very close to 2. If you recall above, the random (that is to say, normally distributed) noise we added to our dataset used these values.

You can try modifying those values above to see if you get the same results here. Of course, you are unlikely to get exactly the same values because our model isn't a perfect fit. Calculating how good our fit (model) is, is a topic for another day.

Finally, let't take a look at the histogram of the residuals.

In [21]:

```
df['residual'].plot.hist(bins=50);
df['residual2'].plot.hist(bins=50);
plt.legend();
```

The blue lines, which show the residuals after the linear trend is removed but before the seasonal cycle was removed, does not appear to be normally distributed.

But it looks like the orange lines (the residuals after the trend and seasonal cycle was removed) looks a lot like a Gaussian curve.

So for our final model, the variance looks to be normally distributed (that is to say Gaussian), as we would expect since that's how we set up our initial dataset.

But it's important to note that the variance could have easily come from another possible distribution (e.g. Poisson, or Weibull), but we would have to run statistical tests to confirm this.

If we are able to identify the underlying distribution of the residuals, we might then be able to describe the underlying process that controls them.

Ultimately, the variance in the residuals to our model could be from natural processes we haven't yet deciphered, or they could come from instrument error or sensitivity (or lack thereof).

Figuring that out usually isn't easy, but that the fun of data analysis and oceanography.

As noted at the beginning, thanks to the boom in Data Science, there are now a number of timeseries analysis libraries available. As an example, let's use the Facebook Prophet tool to generate a model and make a prediction for the next year.

In [22]:

```
# Prophet requires a DataFrame with the colums ds and y
df2 = df['y3'].reset_index()
df2 = df2.rename(columns={'index':'ds', 'y3':'y'})
df2.head()
```

Out[22]:

In [23]:

```
# Model setup, using the default settings
from fbprophet import Prophet
m = Prophet()
m.fit(df2)
```

Out[23]:

In [24]:

```
# Create a time series that goes 1-year into the future
future = m.make_future_dataframe(periods=365)
future.tail()
```

Out[24]:

In [25]:

```
# Make the prediction
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
```

Out[25]:

In [26]:

```
# Plot the model fit and forecast
fig1 = m.plot(forecast)
plt.title('Measured and Modelled Seasonal Datasets');
plt.savefig('Seasonal Model2.png')
```