Weather is something we all experience. Which is why you’ll often find weather-related data used in data analysis courses. Of course, as oceanographers, weather data is far more relevant to our research goals, but it’s also useful to start with more accessible weather or “ocean weather” related examples, as those will be more familiar with students, before diving into more niche oceanographic datasets.
That’s why I love the NDBC dataset, because it makes weather data easily accessible. This allows students to visualize data and look for patterns they are hopefully familiar with (i.e. the weather near them), while they are also learning new data analysis techniques and developing their programming skills.
Here is just one example of showing the annual cycle of Sea Surface Temperature in the Mid Atlantic at NDBC Station 44025 (my favorite station – everyone should have one ;).
With 10-years of data plotted at once, you can quickly see what the mean and variability look like over the course of the year, as well as the impact from the occasional extreme event (read: storm).
With larger datasets like NDBC, which has stations all over the world, students can compare the patterns they’ve identified and are familiar with, with patterns they may not be as familiar with. This could include weather in other areas of the world, or new processes that they are learning about, such as wind/wave correlations, sea breezes (land/air interactions), or heat capacity (air/sea temp relationships).
As part of our 2020 Virtual REU, I created the following notebook to demonstrate some basic data analysis techniques using a few years of data from NDBC Station 44025. This notebook is not comprehensive (that would require a longer course or a textbook to cover), but you can consider it a hodgepodge sampling of some of the most common techniques one might see for this kind of timeseries dataset.
Plus, what I think makes this notebook so cool, is that it demonstrates that it doesn’t take much code to make these sorts of graphs, thanks to the awesome pandas library in python.
Here are a few of the data analysis techniques I highlight in the notebook:
- Basic Statistics – Including count, mean, std, min, max and percentile calculations, as well as identifying extreme values.
- Histograms – An important first step in understanding the shape of a dataset. Along with the basic statistics, it provides a pictorial representation that is quick to interpret.
- Running Averages – Often “raw” data is too “noisy” for how you want to use it. Thankfully, you can easily use .resample() in pandas to calculate hourly, daily or monthly averages (or indeed, any interval you like) to smooth things out.
- Daily Cycle – Many processes repeat regularly over the course of a day. (Hello sunlight!) You can use .groupby() in pandas to average data by the hour of the day to see if there is a diurnal cycle. However, you need to be careful to look at both the mean and the variability about that mean to see if it’s meaningful. In addition, this cycle may differ over the course of a year, so you may also want to look at these averages by season or month or year as well.
- Annual Cycle by Month or Day – Is there an annual pattern? You can also use .groupby() to average your dataset by month or yearday. This will calculate an annual cycle for one year or many years.
- Inter-annual Variation and Variability – Understanding the daily or annual pattern is a great first step, but if you have a longer dataset, you will probably want to investigate how this pattern changes from year to year. Using more complex .groupby() commands or box plots you can analyze: How much variability is there? Is there a seasonal trend that is increasing or decreasing? Is the mean or variability dependent on the year, month, or season? This is when things start to get fun.
- Plot the Raw Data! – When calculating averages, it’s always important to plot the raw data as well as a check, so you can see the averages (and standard deviations or other percentiles about them) in context. It’s important to make sure your data makes sense, without any outliers biasing the averages.
- Wind Roses – If your data is 2-dimenstional, as wind data is, wind roses are a great way to show the directional relationship in your data. Basically they’re a 2-dimensional histogram, common in meteorology, but a bit tricky to create. Thankfully, there’s a library for that too!
- Anomaly Plots – Once you have a long-term timeseries dataset (which NDBC has at many locations) you can calculate a climatology, which is long-term average. You can then use the climatology to calculate anomalies, which are differences from the long-term mean. This is a concept that is often not as familiar with students, and so going through the calculation is a great way to develop their comfort in interpreting anomaly datasets.
This summer, we didn’t spend too much time on this notebook in our mini-research workshop. But having this set of examples handy, made it easy for students to refer to when developing their own data analysis projects. Most students incorporated one or two of the examples, depending on their goals.
I’m also happy I had a chance (excuse) to put these examples together, because as you can see, there is so much you can do with just a few lines of code. And there is so much more you can do in the wide-ranging field of timeseries data analysis, that we’ve barely scratched the surface, even after 3 notebooks.
I hope you find this notebook helpful as you develop your own activities.
Activity 3 - Data Analysis¶
2020 Data Labs REU
Written by Sage Lichtenwalner, Rutgers University, June 12, 2020
In this notebook we will demonstrate a few data analysis strategies you can use to analyze a timeseries dataset. This is by no means an exhaustive list, but hopefully enough to get you started with some common strategies.
The examples today will continue to use the mooring and shore station timeseries data available from NDBC.
# Notebook setup import xarray as xr !pip install netcdf4 import matplotlib.pyplot as plt import numpy as np # Let's make our plots pretty import seaborn as sns sns.set()
Requirement already satisfied: netcdf4 in /usr/local/lib/python3.6/dist-packages (1.5.3) Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.6/dist-packages (from netcdf4) (1.18.5) Requirement already satisfied: cftime in /usr/local/lib/python3.6/dist-packages (from netcdf4) (1.1.3)
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
Load a NDBC Dataset using Pandas¶
Following our previous examples, let's load some data from NDBC and convert it to a Pandas Dataframe.
# Open dataset ds = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml') # Subset the dataset to 1 year ds = ds.sel(time=slice('2019-01-01','2020-01-01')) # Convert to Pandas Dataframe df = ds.to_dataframe().reset_index().set_index('time')
How big is our dataset?
Once we have data loaded in Pandas, we can use
.sample() to see what that dataset looks like. You can pass a number to any of these functions to print out a specific number of rows.
This also works on specific columns.
# One column df['air_temperature'].head(3)
time 2019-01-01 00:50:00 8.1 2019-01-01 01:50:00 8.3 2019-01-01 02:50:00 8.7 Name: air_temperature, dtype: float32
# Multiple columns df[['longitude','latitude']].head(3)
Basic Bulk Statistics¶
Thanks to the magic of pandas, we can quickly calculate a bunch of statistics on our dataset using the
|mean||40.250999||-73.164001||195.684679||6.693528||8.168374||1.274927||0 days 00:00:07.330060||0 days 00:00:04.976726||162.821548||1016.819458||12.502096||13.550716||NaN||NaN||NaN|
|std||0.000000||0.000000||100.992567||3.437403||4.219831||0.736393||0 days 00:00:02.633673||0 days 00:00:00.979788||70.678750||8.206482||8.092100||6.862940||NaN||NaN||NaN|
|min||40.250999||-73.164001||1.000000||0.000000||0.000000||0.220000||0 days 00:00:02.470000||0 days 00:00:02.950000||1.000000||985.599976||-11.700000||3.100000||NaN||NaN||NaN|
|25%||40.250999||-73.164001||106.250000||4.200000||5.100000||0.760000||0 days 00:00:05.260000||0 days 00:00:04.260000||110.000000||1011.500000||6.100000||6.900000||NaN||NaN||NaN|
|50%||40.250999||-73.164001||213.000000||6.200000||7.400000||1.080000||0 days 00:00:07.139999||0 days 00:00:04.829999||152.000000||1016.900024||12.000000||13.100000||NaN||NaN||NaN|
|75%||40.250999||-73.164001||282.000000||8.700000||10.600000||1.570000||0 days 00:00:09.090000||0 days 00:00:05.559999||194.000000||1022.299988||20.100000||20.299999||NaN||NaN||NaN|
|max||40.250999||-73.164001||360.000000||20.299999||26.900000||4.600000||0 days 00:00:17.389999||0 days 00:00:09.739999||360.000000||1040.599976||27.299999||26.600000||NaN||NaN||NaN|
# For specific columns df[['air_temperature','sea_surface_temperature','wind_spd']].describe()
As with a regular Pandas dataframe, these statistics are also a dataframe that can be exported to CSV, in case we want to use it elsewhere.
Note, just because it saves a lot of digits, that doesn't mean they're significant!
# Saving statistics df[['air_temperature','sea_surface_temperature','wind_spd']].describe().to_csv()
We can also request these calculations separately.
- count() - Number of values
- std() - Standard Deviation
- quantile() - Percentile, defaults to 0.5 = 50%
# Example statistical calc df.mean()
latitude 40.251 longitude -73.164 wind_dir 195.685 wind_spd 6.69353 gust 8.16837 wave_height 1.27493 dominant_wpd 0 days 00:00:07.330060 average_wpd 0 days 00:00:04.976726 mean_wave_dir 162.822 air_pressure 1016.82 air_temperature 12.5021 sea_surface_temperature 13.5507 dewpt_temperature NaN visibility NaN water_level NaN dtype: object
# Or we could output these individually print(round(df['wind_spd'].mean(),2)) print(round(df['wind_spd'].std(),2))
# Quickplot to compare df.wind_spd.plot();
# Quickplot to compare df.wind_spd.hist(bins=25);
# Your turn - Calculate and compare the mean and std for air & water temps
We can also grab the largest or smallest values in at dataset using
time 2019-10-17 03:50:00 20.299999 2019-10-17 00:50:00 19.799999 2019-10-17 04:50:00 18.900000 2019-10-17 05:50:00 18.900000 2019-01-31 00:50:00 18.400000 Name: wind_spd, dtype: float32
fig, ax = plt.subplots() fig.set_size_inches(12, 6) df['sea_surface_temperature'].plot(ax=ax,label='Raw', linestyle='None', marker='.', markersize=1) df['sea_surface_temperature'].resample('D').mean().plot(ax=ax,label='Daily') df['sea_surface_temperature'].resample('5D').mean().plot(ax=ax,label='5 Day') df['sea_surface_temperature'].resample('MS').mean().plot(ax=ax,label='Monthly',marker='d') #MS=Month Start plt.legend();
Note that by default, pandas only provides Month Start (MS) or Month End (M) periods. That might be find for business, but in oceanography we want our graphs to have physical meaning. One option, that gets us a little closer to a centered average is to use the loffset parameter.
# Let's try this again, but add a 15-day offset to the monthly average from datetime import timedelta fig, ax = plt.subplots() fig.set_size_inches(12, 6) df['sea_surface_temperature'].plot(ax=ax,label='Raw', linestyle='None', marker='.', markersize=1) df['sea_surface_temperature'].resample('MS').mean().plot(ax=ax,label='Month Start',marker='d') #MS=Month Start df['sea_surface_temperature'].resample('MS',loffset=timedelta(days=15)).mean().plot(ax=ax,label='15th of the Month',marker='d') plt.legend();
# Your Turn - Plot daily or monthly averages (mean) of Air Temp or Wind Speed
# Your Turn - Plot daily or monthly standard deviation (std) for Air Temp or Wind Speed
Daily & Annual Variation (groupby)¶
Similarly, we can look at process that vary of the course of a day or a year by averaging all the hours of a day or days of a year together. (Days of a month probably wouldn't make much sense in oceanography.) To do this, we can use pandas.groupby.
The following example was inspired by this Seasonal Cycle and Anomaly at NDBC 44025 post.
To see some other cool examples on how groupby() can be used, check out this lesson from Ryan Abernathey.
# Load 10 years of data # Open dataset ds = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml') # Subset the dataset to 10 years ds = ds.sel(time=slice('2010-01-01','2020-01-01')) # Convert to Pandas Dataframe df = ds.to_dataframe().reset_index().set_index('time')
# Quickplot of temperature df['sea_surface_temperature'].plot();
We can look at this dataset a number of different ways:
- Hour of the day - Is there a diurnal cycle?
- We could also investigate how this changes seasonally, but we won't get into that here.
- Day of the week - Not really useful here
- Day of the month - Also not useful
- Month of the year - Is there an annual pattern?
- Day of the year - This is basically a higher resolution view of the monthly averages.
For ease of use, let's add some additional columns to our dataset with the hour, month and yearday. This isn't strictly needed, we could just use the references directly, e.g. df.index.hour, instead of the simpler df.hour once it's defined. But sometimes it can come in handy.
# Add column for hour of day df['hour'] = df.index.hour # And month df['month'] = df.index.month # And day of year df['yearday'] = df.index.dayofyear
# Calculate the hourly mean hourly_sst = df.sea_surface_temperature.groupby(df.hour).mean()
# Quick plot hourly_sst.plot();
That's a nice looking plot with a noticeable diurnal signal, but it's important to notice how small it is (about a half a degree). Remember, we're averaging 10-years of data here. So on average, the hour-by-hour mean is not that far off from the total dataset mean.
We can get a clearer look at this by visualizing the range of the dataset. Let's calculate the quartiles (25, 50, 75%) to see what they look like.
# Quantile Function Definitions def q1(x): return x.quantile(0.25) def q2(x): return x.median() def q3(x): return x.quantile(0.75)
# Calculate the hourly quartiles and mean hourly_sst = df.sea_surface_temperature.groupby(df.hour).agg([q1, q2, q3, np.mean])
# Quick plot hourly_sst.plot();
So, yeah, this doesn't tell us much, other than the fact there is a wide spread of the data.
This might be a useful analysis if we were looking at a shorter time period (perhaps a week or a month of data), or if we were looking at this question seasonally (say by looking at just Januaries).
# Calculate the average monthly_sst = df.sea_surface_temperature.groupby(df.month).agg([q1, q2, q3, np.mean]) # Quick Plot monthly_sst.plot();
With a long enough dataset, this approach can give us a Monthly Climatology. Here's a quick example which we could also export and save to use elsewhere.
# Monthly climatology monthly_climatology = df.groupby(df.index.month).mean() monthly_climatology
Annual Cycle by Day¶
Finally, let's repeat what we did above, but rather than taking the average for each month, we will take the average for each day, which should result in a "higher resolution" visualization.
# Calculate the average daily_sst = df.sea_surface_temperature.groupby(df.yearday).agg([q1, q2, q3, np.mean]) # Quick Plot daily_sst.plot();
# Pretty plot of data by Yearday plt.figure(figsize=(10,8)) plt.plot(df.yearday,df.sea_surface_temperature,'.',markersize=1,label='Raw Hourly Data'); plt.plot(daily_sst.index,daily_sst['mean'], linewidth=3,label='10 Year Average') plt.legend() plt.xlabel('Day of Year') plt.ylabel('Sea Surface Temperature (C)') plt.title('Seasonal Cycle of Sea Surface Temperature at NDBC 44025 from 2010-2019'); plt.savefig('NDBC_44025_Seasonal_SST.png');
Monthly Averages Plotted by Year¶
Before we use the resample function to calculate a monthly average.
But what if we want to plot all the years on top of each other to compare the annual cycles? For this, we can use a 2-level groupby.
a = df.groupby([df.index.year,df.index.month]).mean() a
106 rows × 16 columns
# A crude plot a.sea_surface_temperature.unstack(0).plot(cmap='Spectral'); plt.xlabel('Month') plt.ylabel('Average Temperature (C)') plt.title('Monthly Average Sea Surface Temperatures at 44025') plt.legend(loc='upper left');
Climatic Summary Boxplots¶
We can use the seaborne library to easily create monthly or yearly boxplots of our data.
# Monthly sns.boxplot(data = df, x='month', y='sea_surface_temperature');
# Yearly sns.boxplot(data = df, x=df.index.year, y='sea_surface_temperature');
If we wanted to do something fancier, like comparing box plots of Januaries in each year, that would require some additional tricks beyond the scope of this introductory notebook.
And now for something completely different, let's create a Wind rose plot, which is a commonly used to show wind or wave speed averages by direction. It's basically a bar chart that's been wrapped around a compass rose.
Note, by convention, winds in meteorology are usually defined as the direction the wind is blowing from.
In oceanography, direction is typically defined as the direction a current or wave is moving towards.
It's very confusing, so always check your dataset's documentation.
!pip install windrose from windrose import WindroseAxes
Requirement already satisfied: windrose in /usr/local/lib/python3.6/dist-packages (1.6.7) Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from windrose) (1.18.5) Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from windrose) (3.2.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->windrose) (1.2.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->windrose) (0.10.0) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->windrose) (2.8.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->windrose) (2.4.7) Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib->windrose) (1.12.0)
/usr/local/lib/python3.6/dist-packages/windrose/windrose.py:29: MatplotlibDeprecationWarning: The Appender class was deprecated in Matplotlib 3.1 and will be removed in 3.3. addendum = docstring.Appender(msg, "\n\n") /usr/local/lib/python3.6/dist-packages/windrose/windrose.py:30: MatplotlibDeprecationWarning: The copy_dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use docstring.copy() and cbook.dedent() instead. return lambda func: addendum(docstring.copy_dedent(base)(func)) /usr/local/lib/python3.6/dist-packages/windrose/windrose.py:30: MatplotlibDeprecationWarning: The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.getdoc() instead. return lambda func: addendum(docstring.copy_dedent(base)(func)) /usr/local/lib/python3.6/dist-packages/windrose/windrose.py:30: MatplotlibDeprecationWarning: The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead. return lambda func: addendum(docstring.copy_dedent(base)(func))
# We need to reset the default plot style for this to work sns.set_style("white") fig = plt.figure(figsize=(6,6)) # Plot the windrose ax = WindroseAxes.from_ax(fig=fig) ax.bar(df.wind_dir, df.wind_spd, normed=True, opening=1, edgecolor='white') ax.set_legend();
We can also look at just one month to see how it changes over the course of a year. For example, compare January with July.
# Define a month to plot month = 1 fig = plt.figure(figsize=(6,6)) # Plot the windrose ax = WindroseAxes.from_ax(fig=fig) ax.bar(df.wind_dir[df.index.month==month], df.wind_spd[df.index.month==month], normed=True, opening=1, edgecolor='white') ax.set_legend();
Finally, let's show a quick example of an anomaly plot, which depicts the difference in a measured value from a long-term mean. For this example, we'll load approximately 21 years of data and calculate an average temperature for each yearday over the full record. Then we will subtract that average from each day's average to plot the anomaly.
Note, in this example, we also demonstrate how to merge two datasets together, including the archive and recent real-time datasets.
# Open datasets ds1 = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml') #Archive ds2 = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025h9999.nc') #Realtime # Subset the dataset # ds1 = ds1.sel(time=slice('2000-01-01','2020-05-01')) # Set specific date ranges # ds2 = ds2.sel(time=slice('2020-05-01','2021-01-01')) ds1 = ds1.where(ds1.time > np.datetime64('2000-01-01'), drop=True) # Grab all data since 1/1/2000 ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive # Convert to Pandas Dataframe df1 = ds1.to_dataframe().reset_index().set_index('time') df2 = ds2.to_dataframe().reset_index().set_index('time') # Merge these datasets df = df1.combine_first(df2)
# Change plot style again sns.set_style("darkgrid") # Plot all data ds1.sea_surface_temperature.plot() ds2.sea_surface_temperature.plot();
# Anomaly claculation def standardize(x): return (x - x.mean())/x.std() # Calculate a daily average df_daily = df.resample('1D').mean() # Calculate the anomaly for each yearday anomaly = df_daily['sea_surface_temperature'].groupby(df_daily.index.dayofyear).transform(standardize)
# Plot the full record anomaly.plot(); plt.ylabel('SST Anomaly (C)');
# Plot just this year (so far) anomaly.plot(); import datetime plt.xlim(datetime.date(2020,1,1),datetime.date(2020,7,1)) plt.ylabel('SST Anomaly (C)');
Note, the code above was adapted from this example from Ryan Abernathey, which calculates the anomaly for the full record that you input into the function.
If you wanted to calculate a climatic average first (say for 1980-2010), and then use that average to calculate an anomaly from a different dataset (say from 2010-2020), you can try the approach used in this example by Sage.