Introduction to Python – Data Analysis

Weather is something we all experience. Which is why you’ll often find weather-related data used in data analysis courses. Of course, as oceanographers, weather data is far more relevant to our research goals, but it’s also useful to start with more accessible weather or “ocean weather” related examples, as those will be more familiar with students, before diving into more niche oceanographic datasets.

That’s why I love the NDBC dataset, because it makes weather data easily accessible. This allows students to visualize data and look for patterns they are hopefully familiar with (i.e. the weather near them), while they are also learning new data analysis techniques and developing their programming skills.

Here is just one example of showing the annual cycle of Sea Surface Temperature in the Mid Atlantic at NDBC Station 44025 (my favorite station – everyone should have one ;).

A graph showing the seasonal cycle of Sea Surface Temperature at NDBC Station 44025

With 10-years of data plotted at once, you can quickly see what the mean and variability look like over the course of the year, as well as the impact from the occasional extreme event (read: storm).

With larger datasets like NDBC, which has stations all over the world, students can compare the patterns they’ve identified and are familiar with, with patterns they may not be as familiar with. This could include weather in other areas of the world, or new processes that they are learning about, such as wind/wave correlations, sea breezes (land/air interactions), or heat capacity (air/sea temp relationships).

As part of our 2020 Virtual REU, I created the following notebook to demonstrate some basic data analysis techniques using a few years of data from NDBC Station 44025. This notebook is not comprehensive (that would require a longer course or a textbook to cover), but you can consider it a hodgepodge sampling of some of the most common techniques one might see for this kind of timeseries dataset.

Plus, what I think makes this notebook so cool, is that it demonstrates that it doesn’t take much code to make these sorts of graphs, thanks to the awesome pandas library in python.

Here are a few of the data analysis techniques I highlight in the notebook:

  • Basic Statistics – Including count, mean, std, min, max and percentile calculations, as well as identifying extreme values.
  • Histograms – An important first step in understanding the shape of a dataset. Along with the basic statistics, it provides a pictorial representation that is quick to interpret.
  • Running Averages – Often “raw” data is too “noisy” for how you want to use it. Thankfully, you can easily use .resample() in pandas to calculate hourly, daily or monthly averages (or indeed, any interval you like) to smooth things out.
  • Daily Cycle – Many processes repeat regularly over the course of a day. (Hello sunlight!) You can use .groupby() in pandas to average data by the hour of the day to see if there is a diurnal cycle. However, you need to be careful to look at both the mean and the variability about that mean to see if it’s meaningful. In addition, this cycle may differ over the course of a year, so you may also want to look at these averages by season or month or year as well.
  • Annual Cycle by Month or Day – Is there an annual pattern? You can also use .groupby() to average your dataset by month or yearday. This will calculate an annual cycle for one year or many years.
  • Inter-annual Variation and Variability – Understanding the daily or annual pattern is a great first step, but if you have a longer dataset, you will probably want to investigate how this pattern changes from year to year. Using more complex .groupby() commands or box plots you can analyze: How much variability is there? Is there a seasonal trend that is increasing or decreasing? Is the mean or variability dependent on the year, month, or season? This is when things start to get fun.
  • Plot the Raw Data! – When calculating averages, it’s always important to plot the raw data as well as a check, so you can see the averages (and standard deviations or other percentiles about them) in context. It’s important to make sure your data makes sense, without any outliers biasing the averages.
  • Wind Roses – If your data is 2-dimenstional, as wind data is, wind roses are a great way to show the directional relationship in your data. Basically they’re a 2-dimensional histogram, common in meteorology, but a bit tricky to create. Thankfully, there’s a library for that too!
  • Anomaly Plots – Once you have a long-term timeseries dataset (which NDBC has at many locations) you can calculate a climatology, which is long-term average. You can then use the climatology to calculate anomalies, which are differences from the long-term mean. This is a concept that is often not as familiar with students, and so going through the calculation is a great way to develop their comfort in interpreting anomaly datasets.

This summer, we didn’t spend too much time on this notebook in our mini-research workshop. But having this set of examples handy, made it easy for students to refer to when developing their own data analysis projects.  Most students incorporated one or two of the examples, depending on their goals.

I’m also happy I had a chance (excuse) to put these examples together, because as you can see, there is so much you can do with just a few lines of code. And there is so much more you can do in the wide-ranging field of timeseries data analysis, that we’ve barely scratched the surface, even after 3 notebooks.

I hope you find this notebook helpful as you develop your own activities.

This post is part of our 2020 Summer REU Intro to Python series. See also Part 1, Part 2, and Part 4.