Introduction to Python – Part 2
Teaching students how to visualize ocean data is a challenge. But before you get into cognitive theory, choosing colors, or the the principles of (good) visualization design, you really just need to get your students’ feet wet plotting some data.
This summer, as part our Virtual REU 2-week mini-workshop, we challenged students to work in groups to find interesting stories in the NDBC dataset. But before setting them off on their mini-research projects, we spent three sessions introducing students to python, NDBC data, and basic plotting.
As described in Part 1, in our first session we introduced students to python and the Google Colab environment. In the second session, students worked in groups to practice these basics, while making quick plots of the NDBC buoy data. Finally, in our third session, we demonstrated some python plotting basics using following notebook. If you’re curious, here are a few slides I also used as 30-second introduction to Data Vis.
Data Visualization Basics
The notebook below features some, but by no means all, of the ways you can create and customize plots in python, focusing primarily on timeseries plots.
The notebook briefly touches on customizing marks and lines, changing axes labels and limits, adding legends, and using subplots. Almost as an afterthought, I included a brief mention of scatterplots, histograms, and box plots, because those plots are so easy to make in python with just one line of code – assuming you know what you’re doing. I felt it was good to introduce students to a few additional plot types that are relevant to time series datasets, while also demonstrating the power of the tool to encourage students to explore more in the future.
It turns out, even simple isn’t that simple. As this was my first time teaching this skill to novice learners, I discovered quite a few hangups students can run into. Here are just a few:
- xarray vs. pandas – Both of these libraries are wonderful. While pandas is designed for “tabular” datasets, like those you find in Excel, xarray is a bit more complicated as it is designed to support multi-dimensional datasets. If your data is simple, like a CSV file with a timeseries or discrete data points, pandas is all you need. It has a far simpler data-model, and is a bit more intuitive for data analysis or visualization. Unfortunately, the NDBC dataset is served over THREDDS, which requires using the xarray library to access it. You’ll see in the notebook I ran through some hoops to convert the xarray Dataset to a pandas Dataframe, in the hopes that it might make the rest of the notebook easier to follow. But this was definitely not easy to get across to students at the very beginning. If you can, I’d suggest sticking with pandas-friendly datasets to start.
- Internal plotting functions – Both xarray and pandas have internal plotting methods that allow you to quickly create plots, without having to call matplotlib explicitly, e.g. with plt.plot(). These methods are also pretty “smart” in that they will label and title your graphs using information in your data, using column names and units if available. That’s a great feature, if you know how the black boxes work, but it’s yet another confusing point for students. For example, the syntax for
plt.plot()
,xr.Dataset.plot()
andpd.Dataframe.plot()
can differ, depending on what you’re trying to do. While I love the internal methods, in the future, for a basic intro to plotting, I think I’ll start with matplotlib, and have students create graphs/axes/labels manually, allowing them to discover the benefits of the internal methods later. - plot vs. scatterplot – This one is on me. For some reason, in this notebook, when I wanted to create a scatter plot, I used
plt.plot(x, y, linestyle='', marker='.')
instead of just callingplt.scatter()
. While it’s good to remind students that there are often many way to solve a problem, especially in python, it’s also good to keep things simple to start. - datetime – Dates and times are the bane of every programmer. Python has a lot of great tools for working with dates (some might argue too many), but you still have to figure them out. Sometimes that beautiful looking date-time array in your Dataframe isn’t actually datetime.datetime() friendly, but rather just an array of strings. To a human it looks the same. But to the computer, they’re totally different. And that takes some time to learn. That said, I have no idea how to teach this quickly or well, other than suggesting you make sure your datasets and examples are clear.
- NDBC Data – Finally, just a note on NDBC… I absolutely love this dataset, because the variables are familiar to students, it is (relatively) easy to access, and it has a global coverage allowing students to explore a variety of questions. However, like every other data portal out there, you can get confused the datasets that are actually available. On the homepage, NDBC lists over 1,400 stations, but only about 150 are meteorological buoys in the ocean. We had quite a few students interested in data from estuarine stations, tide gauges, and the TAO array. Sadly, not all of that data was as easy to access as the blue water buoys.
If I have a chance to teach an introduction to data visualization course in the future (ideally with more time), I hope to figure out some new approaches to these challenges. And if you have any ideas, I’d love to hear them too!
But in the end, I think students in our REU really appreciated having this notebook, with its collection of examples to refer to, as they worked on their own plots.
In fact, I hadn’t originally intended to include the two axes example in the notebook, but it was something students really wanted to know how to do. (In the data vis community, many have made arguments against duel axes charts, but they’re pretty common in oceanography.)
At any rate, if you are working on your own ocean/python/datavis introduction, I hope you find this notebook helpful!
This post is part of our 2020 Summer REU Intro to Python series. See also Part 1, Part 3, and Part 4.