Introduction to Python – Data Analysis
Weather is something we all experience. Which is why you’ll often find weather-related data used in data analysis courses. Of course, as oceanographers, weather data is far more relevant to our research goals, but it’s also useful to start with more accessible weather or “ocean weather” related examples, as those will be more familiar with students, before diving into more niche oceanographic datasets.
That’s why I love the NDBC dataset, because it makes weather data easily accessible. This allows students to visualize data and look for patterns they are hopefully familiar with (i.e. the weather near them), while they are also learning new data analysis techniques and developing their programming skills.
Here is just one example of showing the annual cycle of Sea Surface Temperature in the Mid Atlantic at NDBC Station 44025 (my favorite station – everyone should have one ;).
With 10-years of data plotted at once, you can quickly see what the mean and variability look like over the course of the year, as well as the impact from the occasional extreme event (read: storm).
With larger datasets like NDBC, which has stations all over the world, students can compare the patterns they’ve identified and are familiar with, with patterns they may not be as familiar with. This could include weather in other areas of the world, or new processes that they are learning about, such as wind/wave correlations, sea breezes (land/air interactions), or heat capacity (air/sea temp relationships).
As part of our 2020 Virtual REU, I created the following notebook to demonstrate some basic data analysis techniques using a few years of data from NDBC Station 44025. This notebook is not comprehensive (that would require a longer course or a textbook to cover), but you can consider it a hodgepodge sampling of some of the most common techniques one might see for this kind of timeseries dataset.
Plus, what I think makes this notebook so cool, is that it demonstrates that it doesn’t take much code to make these sorts of graphs, thanks to the awesome pandas library in python.
Here are a few of the data analysis techniques I highlight in the notebook:
- Basic Statistics – Including count, mean, std, min, max and percentile calculations, as well as identifying extreme values.
- Histograms – An important first step in understanding the shape of a dataset. Along with the basic statistics, it provides a pictorial representation that is quick to interpret.
- Running Averages – Often “raw” data is too “noisy” for how you want to use it. Thankfully, you can easily use .resample() in pandas to calculate hourly, daily or monthly averages (or indeed, any interval you like) to smooth things out.
- Daily Cycle – Many processes repeat regularly over the course of a day. (Hello sunlight!) You can use .groupby() in pandas to average data by the hour of the day to see if there is a diurnal cycle. However, you need to be careful to look at both the mean and the variability about that mean to see if it’s meaningful. In addition, this cycle may differ over the course of a year, so you may also want to look at these averages by season or month or year as well.
- Annual Cycle by Month or Day – Is there an annual pattern? You can also use .groupby() to average your dataset by month or yearday. This will calculate an annual cycle for one year or many years.
- Inter-annual Variation and Variability – Understanding the daily or annual pattern is a great first step, but if you have a longer dataset, you will probably want to investigate how this pattern changes from year to year. Using more complex .groupby() commands or box plots you can analyze: How much variability is there? Is there a seasonal trend that is increasing or decreasing? Is the mean or variability dependent on the year, month, or season? This is when things start to get fun.
- Plot the Raw Data! – When calculating averages, it’s always important to plot the raw data as well as a check, so you can see the averages (and standard deviations or other percentiles about them) in context. It’s important to make sure your data makes sense, without any outliers biasing the averages.
- Wind Roses – If your data is 2-dimenstional, as wind data is, wind roses are a great way to show the directional relationship in your data. Basically they’re a 2-dimensional histogram, common in meteorology, but a bit tricky to create. Thankfully, there’s a library for that too!
- Anomaly Plots – Once you have a long-term timeseries dataset (which NDBC has at many locations) you can calculate a climatology, which is long-term average. You can then use the climatology to calculate anomalies, which are differences from the long-term mean. This is a concept that is often not as familiar with students, and so going through the calculation is a great way to develop their comfort in interpreting anomaly datasets.
This summer, we didn’t spend too much time on this notebook in our mini-research workshop. But having this set of examples handy, made it easy for students to refer to when developing their own data analysis projects. Most students incorporated one or two of the examples, depending on their goals.
I’m also happy I had a chance (excuse) to put these examples together, because as you can see, there is so much you can do with just a few lines of code. And there is so much more you can do in the wide-ranging field of timeseries data analysis, that we’ve barely scratched the surface, even after 3 notebooks.
I hope you find this notebook helpful as you develop your own activities.
This post is part of our 2020 Summer REU Intro to Python series. See also Part 1, Part 2, and Part 4.
Activity 3 - Data Analysis¶
2020 Data Labs REU
Written by Sage Lichtenwalner, Rutgers University, June 12, 2020
In this notebook we will demonstrate a few data analysis strategies you can use to analyze a timeseries dataset. This is by no means an exhaustive list, but hopefully enough to get you started with some common strategies.
The examples today will continue to use the mooring and shore station timeseries data available from NDBC.
# Notebook setup
import xarray as xr
!pip install netcdf4
import matplotlib.pyplot as plt
import numpy as np
# Let's make our plots pretty
import seaborn as sns
sns.set()
Load a NDBC Dataset using Pandas¶
Following our previous examples, let's load some data from NDBC and convert it to a Pandas Dataframe.
# Open dataset
ds = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml')
# Subset the dataset to 1 year
ds = ds.sel(time=slice('2019-01-01','2020-01-01'))
# Convert to Pandas Dataframe
df = ds.to_dataframe().reset_index().set_index('time')
How big is our dataset?
df.shape
Once we have data loaded in Pandas, we can use .head()
.tail()
and .sample()
to see what that dataset looks like. You can pass a number to any of these functions to print out a specific number of rows.
df.head(3)
This also works on specific columns.
# One column
df['air_temperature'].head(3)
# Multiple columns
df[['longitude','latitude']].head(3)
Basic Bulk Statistics¶
Thanks to the magic of pandas, we can quickly calculate a bunch of statistics on our dataset using the .describe()
function.
df.describe()
# For specific columns
df[['air_temperature','sea_surface_temperature','wind_spd']].describe()
As with a regular Pandas dataframe, these statistics are also a dataframe that can be exported to CSV, in case we want to use it elsewhere.
Note, just because it saves a lot of digits, that doesn't mean they're significant!
# Saving statistics
df[['air_temperature','sea_surface_temperature','wind_spd']].describe().to_csv()
Individual statistics¶
We can also request these calculations separately.
- count() - Number of values
- mean()
- std() - Standard Deviation
- min()
- max()
- quantile() - Percentile, defaults to 0.5 = 50%
# Example statistical calc
df.mean()
# Or we could output these individually
print(round(df['wind_spd'].mean(),2))
print(round(df['wind_spd'].std(),2))
# Quickplot to compare
df.wind_spd.plot();
# Quickplot to compare
df.wind_spd.hist(bins=25);
# Your turn - Calculate and compare the mean and std for air & water temps
Finding Extremes¶
We can also grab the largest or smallest values in at dataset using nlargest()
or nsmallest()
.
df.wind_spd.nlargest()
Running Averages (resample)¶
We can also easily calculate hourly, daily and monthly averages using pandas.resample. A number of offset options are available for specifying the interval.
Here's a quick example that shows a few different intervals.
fig, ax = plt.subplots()
fig.set_size_inches(12, 6)
df['sea_surface_temperature'].plot(ax=ax,label='Raw', linestyle='None', marker='.', markersize=1)
df['sea_surface_temperature'].resample('D').mean().plot(ax=ax,label='Daily')
df['sea_surface_temperature'].resample('5D').mean().plot(ax=ax,label='5 Day')
df['sea_surface_temperature'].resample('MS').mean().plot(ax=ax,label='Monthly',marker='d') #MS=Month Start
plt.legend();
Note that by default, pandas only provides Month Start (MS) or Month End (M) periods. That might be find for business, but in oceanography we want our graphs to have physical meaning. One option, that gets us a little closer to a centered average is to use the loffset parameter.
# Let's try this again, but add a 15-day offset to the monthly average
from datetime import timedelta
fig, ax = plt.subplots()
fig.set_size_inches(12, 6)
df['sea_surface_temperature'].plot(ax=ax,label='Raw', linestyle='None', marker='.', markersize=1)
df['sea_surface_temperature'].resample('MS').mean().plot(ax=ax,label='Month Start',marker='d') #MS=Month Start
df['sea_surface_temperature'].resample('MS',loffset=timedelta(days=15)).mean().plot(ax=ax,label='15th of the Month',marker='d')
plt.legend();
# Your Turn - Plot daily or monthly averages (mean) of Air Temp or Wind Speed
# Your Turn - Plot daily or monthly standard deviation (std) for Air Temp or Wind Speed
Daily & Annual Variation (groupby)¶
Similarly, we can look at process that vary of the course of a day or a year by averaging all the hours of a day or days of a year together. (Days of a month probably wouldn't make much sense in oceanography.) To do this, we can use pandas.groupby.
The following example was inspired by this Seasonal Cycle and Anomaly at NDBC 44025 post.
To see some other cool examples on how groupby() can be used, check out this lesson from Ryan Abernathey.
# Load 10 years of data
# Open dataset
ds = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml')
# Subset the dataset to 10 years
ds = ds.sel(time=slice('2010-01-01','2020-01-01'))
# Convert to Pandas Dataframe
df = ds.to_dataframe().reset_index().set_index('time')
# Quickplot of temperature
df['sea_surface_temperature'].plot();
We can look at this dataset a number of different ways:
- Hour of the day - Is there a diurnal cycle?
- We could also investigate how this changes seasonally, but we won't get into that here.
- Day of the week - Not really useful here
- Day of the month - Also not useful
- Month of the year - Is there an annual pattern?
- Day of the year - This is basically a higher resolution view of the monthly averages.
For ease of use, let's add some additional columns to our dataset with the hour, month and yearday. This isn't strictly needed, we could just use the references directly, e.g. df.index.hour, instead of the simpler df.hour once it's defined. But sometimes it can come in handy.
# Add column for hour of day
df['hour'] = df.index.hour
# And month
df['month'] = df.index.month
# And day of year
df['yearday'] = df.index.dayofyear
Daily Cycle¶
# Calculate the hourly mean
hourly_sst = df.sea_surface_temperature.groupby(df.hour).mean()
# Quick plot
hourly_sst.plot();
That's a nice looking plot with a noticeable diurnal signal, but it's important to notice how small it is (about a half a degree). Remember, we're averaging 10-years of data here. So on average, the hour-by-hour mean is not that far off from the total dataset mean.
We can get a clearer look at this by visualizing the range of the dataset. Let's calculate the quartiles (25, 50, 75%) to see what they look like.
# Quantile Function Definitions
def q1(x):
return x.quantile(0.25)
def q2(x):
return x.median()
def q3(x):
return x.quantile(0.75)
# Calculate the hourly quartiles and mean
hourly_sst = df.sea_surface_temperature.groupby(df.hour).agg([q1, q2, q3, np.mean])
# Quick plot
hourly_sst.plot();
So, yeah, this doesn't tell us much, other than the fact there is a wide spread of the data.
This might be a useful analysis if we were looking at a shorter time period (perhaps a week or a month of data), or if we were looking at this question seasonally (say by looking at just Januaries).
Annual Cycle by Month¶
# Calculate the average
monthly_sst = df.sea_surface_temperature.groupby(df.month).agg([q1, q2, q3, np.mean])
# Quick Plot
monthly_sst.plot();
With a long enough dataset, this approach can give us a Monthly Climatology. Here's a quick example which we could also export and save to use elsewhere.
# Monthly climatology
monthly_climatology = df.groupby(df.index.month).mean()
monthly_climatology
Annual Cycle by Day¶
Finally, let's repeat what we did above, but rather than taking the average for each month, we will take the average for each day, which should result in a "higher resolution" visualization.
# Calculate the average
daily_sst = df.sea_surface_temperature.groupby(df.yearday).agg([q1, q2, q3, np.mean])
# Quick Plot
daily_sst.plot();
# Pretty plot of data by Yearday
plt.figure(figsize=(10,8))
plt.plot(df.yearday,df.sea_surface_temperature,'.',markersize=1,label='Raw Hourly Data');
plt.plot(daily_sst.index,daily_sst['mean'], linewidth=3,label='10 Year Average')
plt.legend()
plt.xlabel('Day of Year')
plt.ylabel('Sea Surface Temperature (C)')
plt.title('Seasonal Cycle of Sea Surface Temperature at NDBC 44025 from 2010-2019');
plt.savefig('NDBC_44025_Seasonal_SST.png');
Monthly Averages Plotted by Year¶
Before we use the resample function to calculate a monthly average.
df['sea_surface_temperature'].resample('MS').mean().plot();
But what if we want to plot all the years on top of each other to compare the annual cycles? For this, we can use a 2-level groupby.
a = df.groupby([df.index.year,df.index.month]).mean()
a
# A crude plot
a.sea_surface_temperature.unstack(0).plot(cmap='Spectral');
plt.xlabel('Month')
plt.ylabel('Average Temperature (C)')
plt.title('Monthly Average Sea Surface Temperatures at 44025')
plt.legend(loc='upper left');
Climatic Summary Boxplots¶
We can use the seaborne library to easily create monthly or yearly boxplots of our data.
# Monthly
sns.boxplot(data = df, x='month', y='sea_surface_temperature');
# Yearly
sns.boxplot(data = df, x=df.index.year, y='sea_surface_temperature');
If we wanted to do something fancier, like comparing box plots of Januaries in each year, that would require some additional tricks beyond the scope of this introductory notebook.
Wind rose¶
And now for something completely different, let's create a Wind rose plot, which is a commonly used to show wind or wave speed averages by direction. It's basically a bar chart that's been wrapped around a compass rose.
Note, by convention, winds in meteorology are usually defined as the direction the wind is blowing from.
In oceanography, direction is typically defined as the direction a current or wave is moving towards.
It's very confusing, so always check your dataset's documentation.
!pip install windrose
from windrose import WindroseAxes
# We need to reset the default plot style for this to work
sns.set_style("white")
fig = plt.figure(figsize=(6,6))
# Plot the windrose
ax = WindroseAxes.from_ax(fig=fig)
ax.bar(df.wind_dir, df.wind_spd, normed=True, opening=1, edgecolor='white')
ax.set_legend();
We can also look at just one month to see how it changes over the course of a year. For example, compare January with July.
# Define a month to plot
month = 1
fig = plt.figure(figsize=(6,6))
# Plot the windrose
ax = WindroseAxes.from_ax(fig=fig)
ax.bar(df.wind_dir[df.index.month==month], df.wind_spd[df.index.month==month], normed=True, opening=1, edgecolor='white')
ax.set_legend();
Anomaly¶
Finally, let's show a quick example of an anomaly plot, which depicts the difference in a measured value from a long-term mean. For this example, we'll load approximately 21 years of data and calculate an average temperature for each yearday over the full record. Then we will subtract that average from each day's average to plot the anomaly.
Note, in this example, we also demonstrate how to merge two datasets together, including the archive and recent real-time datasets.
# Open datasets
ds1 = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025.ncml') #Archive
ds2 = xr.open_dataset('https://dods.ndbc.noaa.gov/thredds/dodsC/data/stdmet/44025/44025h9999.nc') #Realtime
# Subset the dataset
# ds1 = ds1.sel(time=slice('2000-01-01','2020-05-01')) # Set specific date ranges
# ds2 = ds2.sel(time=slice('2020-05-01','2021-01-01'))
ds1 = ds1.where(ds1.time > np.datetime64('2000-01-01'), drop=True) # Grab all data since 1/1/2000
ds2 = ds2.where(ds2.time > ds1.time.max(), drop=True) # Grab all data since the end of the archive
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
# Merge these datasets
df = df1.combine_first(df2)
# Change plot style again
sns.set_style("darkgrid")
# Plot all data
ds1.sea_surface_temperature.plot()
ds2.sea_surface_temperature.plot();
# Anomaly claculation
def standardize(x):
return (x - x.mean())/x.std()
# Calculate a daily average
df_daily = df.resample('1D').mean()
# Calculate the anomaly for each yearday
anomaly = df_daily['sea_surface_temperature'].groupby(df_daily.index.dayofyear).transform(standardize)
# Plot the full record
anomaly.plot();
plt.ylabel('SST Anomaly (C)');
# Plot just this year (so far)
anomaly.plot();
import datetime
plt.xlim(datetime.date(2020,1,1),datetime.date(2020,7,1))
plt.ylabel('SST Anomaly (C)');
Note, the code above was adapted from this example from Ryan Abernathey, which calculates the anomaly for the full record that you input into the function.
If you wanted to calculate a climatic average first (say for 1980-2010), and then use that average to calculate an anomaly from a different dataset (say from 2010-2020), you can try the approach used in this example by Sage.