Scattering Data to See Correlations

The most popular data visualizations in oceanography are probably timeseries plots and maps.  But I suspect a strong third is the scatterplot. While a timeseries plot can show how a variable changes in time, and maps can show variation in space, a scatterplot clearly visualizes how two variables vary with each other.

In an earlier post, we saw that air and sea temperatures seemed to be well correlated over the course of a year.  In the timeseries plot, the two lines for air and sea temps moved up and down in tandem, though there was a lot more variability (you could call it squiggliness) in the air temperature measurements.  But that apparent correlation was really just my observation based on the graph, and my years of experience reading timeseries plots.  If I was a novice observer, as many of our students are, I might not have recognized the correlation, or perhaps I might assume the variables were more correlated than they really are.  To help in this, we can turn to the scatterplot (and some math).

The graph above shows the correlation between the hourly air and seawater temperatures over the course of 2018.  As we may have inferred form the timeseries plot, now we can clearly see that the two temperature measurements are somewhat linearly correlated – when the seawater temps are high, so are the air temps, and vice versa.  

The dotted line on the chart represents what we would see if the air and water temps were equal each hour.  Of course, this is not the case.  Air temperatures can sometimes be warmer than the ocean, but more often they are actually cooler.  And we can also see that when ocean waters are cooler, there is a lot more variability in the air temperatures, as seen in the larger range of values possible.

To keep this graph simple, I did not include a regression line.  But for those of you familiar with looking at these graphs, you might expect that the regression line would be a bit lower than the dotted line, with a slightly larger slope and a y-axis crossing a few degrees below zero.  (You can grab the notebook below to try it for yourself.)  The data has an R2 value of 0.77, which means that about 77% of the variation can be explained by a linear relationship.  That’s a lot, but it’s not a perfect fit.

For fun, and to help the viewer dive a little deeper, I added a 3rd dimension to the chart, using color to show the time of each measurement.  It shouldn’t be a surprise here, but the coldest air and seawater temperatures are see in Jan and Dec, with the warmest occurring in July and August.  

This might seem obvious, especially if you’ve already seen the corresponding timeseries plot or are very familiar with the data or oceanographic process being represented.  But we need to remember that many viewers of our graphs may not have the same knowledge or experience as we do.  More than likely they won’t, because you’re the expert on the dataset you’re visualizing.

While I might have a good understanding of how temperature should vary and correlate over the course of a year, I might not know how dissolved oxygen, chlorophyll or pCO2 would typically behave.  More importantly, if I wanted to understand how a dataset behaves differently from the norm, a norm I might not even know, I would need to see both the result one might typically expect (like the dotted line in the chart above) and the actual result all on the same graph for context. 

So, it’s important to remember that when we visualize our data we need to visually show all of the things we want others to understand about our dataset.  And if we want to show whether two datasets are correlated, a scatterplot is a good place to start. 

If you’d like to continue playing with this dataset, you can download the python notebook.