The Upside and Downside of Basic Statistics

When it comes to analyzing and interpreting data, one of the first tools a scientist will reach for are a few basic statistics. This includes calculations like mean, median, standard deviation and range, though there are certainly many others. The great thing about these descriptive statistics is that they can reduce many data points into a single number. The bad thing about statistics, is that they reduce many data points into a single statistic, which can obscure features in the data that might be important.

As an example, let’s plot a year-long time-series of seawater temperature data recorded at the 30m CTD on the OOI’s Global Station Papa Flanking Mooring B.

In this plot, I’ve also included a dashed line that shows the arithmetic mean of the entire dataset. It’s technically not a timeseries line, but because it represents the mean over the full range, it’s often depicted as such.

Means and standard deviations are only one set of statistics. We can also use percentile ranks, the most common of which is the median, or the 50th percentile. Other common percentiles include the quartile levels (25, 75%), and the minimum and maximum of the data (0 or 100%). These are quite commonly shown on a box plot developed by John Tukey at Bell Labs here in NJ in the late 70’s. Here’s an example using the same dataset.

If you’re analyzing data using the pandas Python library, you can use the pandas.describe() method to quickly obtain a number of basic descriptive statistics. For the dataset above, the output looks like this:

count 35040.000000 mean 7.125207 std 1.416507 min 5.345440 25% 5.958201 50% 6.741027 75% 7.865782 max 13.013896

In one line of code, we can get a quick summary of a given dataset, but what does this tell us?

Descriptive Statistics are Only the Beginning

If we just take a mean value by itself, it isn’t always very helpful. Descriptive statistics can provide us with some insights on a dataset, like helping us characterize the data generally, but they are often only really helpful when making comparisons, either to other datasets, other time periods, or both.

In the CTD dataset above, the mean temperature 7.1°C, tells us that this location is generally cold (at least compared to, say, the Mid Atlantic). But it tells us nothing about the seasonal, daily our hourly variability and nuance that we can clearly see in the original dataset. We can also see that the standard deviation (1.4°C) gives us a measure of how much and how far the temperature varies from the mean over the course of the dataset. But again, on its own it doesn’t tell us very much without any context. If we had other standard deviation measurements to compare this value to (like from other CTD stations or from different time periods), we could make judgements on how much more or less the temperatures at this location varies over time or space.

In my post on air/sea temperatures, you can see that we can calculate daily or monthly mean temperatures. This allows us to easily compare the 2 datasets with each other, since much of the day-to-day “noise” (aka variability) has been left out. It also helps us more clearly see how each time series evolves over the course of the year.

Basic statistical measurements can be helpful, but one has to be careful when interpreting what they mean. Ultimately, you lose fidelity when reducing hundreds or thousands of data points to a single number.

Here is another version of the plot above, only this time I’ve included the mean line, a box that shows the mean plus or minus 1 standard deviation, the median line, and a box that encloses the 25-75 percentile range of the data.

Details Matter

As we can see, none of these statistical calculations tell us anything at all about some of the key features in this dataset.

For example, we can observe…

Temperatures are pretty consistent during the winter, but during the summer months the temperatures vary a lot.
In the first part of the dataset (late summer), the temperatures vary between 6-7.5°C, and then suddenly in late October they jump up to around 12°C.
After that jump, the temperatures become very consistent again (i.e. there is little variability), and the temperatures start gradually decreasing over the course of the next few months back down to around 5.5°C.
In April, the temperature and the variability both start to gradually increase again, repeating what seems to be a seasonal cycle.

Of course, this is a cool story. There’s so much going on in this seemingly simple dataset. In one year of temperature data from 1 CTD we can see a number of Physical Oceanographic processes at play. We have seasonal heating and cooling, a warmer summer surface layer that can mix down to the 30m level only to cool off as winter comes (at least that’s the hypothesis), and perhaps even some internal waves spicing things up. If we only looked at the descriptive statistics we wouldn’t see any of this, and in fact, those generalized statistics don’t tell us much on their own.

So, as we train future scientists, we need to make sure that they know when descriptive statistics can be helpful and when they are not. If all you do is calculate some general statistics, and neglect to compare those statistics with other datasets, or worse, forget to look at the underlying data, you’re probably going to miss the really cool story hiding in your data.

If you’d like to continue playing with this dataset, you can download the python notebook.