Physics to data science: Data visualization using Seaborn in Python

Last time, I presented an analysis of some education data available from IPEDS. For the visualization, I used a Python package called Seaborn. After about eight years of using MATLAB and Mathematica for plotting, I was astounded by the quality of the plots. Here, I want to talk a bit about Seaborn, and the learning curve I ascended.

If you'd like to follow along, here's a link to the .csv files I'm using for this post.

Seaborn

Seaborn specializes in plotting categorical data, visualizing linear relationships [something else?]. It handles uncertainties very well, plotting standard deviation bars, and linear regressions by default. It runs off the back of Matplotlib, another plotting package.

The Seaborn website includes a "tutorial" and a gallery, but the tutorial is very limited, and frankly, not basic enough for me. Here, I'll show a couple of examples in more detail.

Here's an example of what Seaborn can do:

Input data

You can feed Seaborn a variety of data formats, but it's convenient to use DataFrames, since (1) it's used in Pandas, and (2) it's a damned elegant way to represent data. I didn't know anything about Pandas when I started this project, and it took me a few false starts to get the .csv files into a form that Seaborn liked. Here's how I did it.

.csv to DataFrame

A comma separated value file looks like this:

while a dataframe of the same simple data looks like this:

Some differences are:
1) The column headers have become labels, and are no longer part of the columns
2) A new index column was added to act as an identifying key
3) The data type of each column is stored in memory

Pandas gives us a way to import data from a .csv directly into a dataframe using the read_csv function:

Note that we loaded some packages here, and called them by shorter aliases so we don't have to type long names later.

Plotting

I used the function regplot to generate the above plot. It generates a scatter plot, and it automatically does a regression and plots the best fit along with the 95% confidence predictions [Note: It causes me physical pain to plot a linear regression and confidence interval when I have no reliable information about the random process generating my data. In my defense, I'm ignoring the results of the regression entirely]. As inputs, it takes the DataFrame containing the data, and it takes references (in the form of the column label strings) to the columns we want to plot. The above plot just needs two columns - one for each axis.

Calling regplot returns an "axis" object. Next, we have to tell Python to put that object into a plot and show it. We use the srs.plt.show() command:

We have dozens of options to tweak the appearance of this graph, but the raw output already looks better than about 95% of the graphs I've published. The plot window has a save option, and you can export the figure as a .pdf and then edit it in any vector graphic program (I use both Adobe Illustrator and Inkscape). But if you're a baller or a masochist, you might prefer to modify it in Python.
The plots take something like a style sheet, where you can choose a theme based on what you're using the graphic for. It changes line thicknesses and font sizes, among other things, for slides, papers, or posters. Change it with the set_context command.

Color schemes

What would we sink our copious free time into if it weren't for color scheme choices?

You're free to define whatever colors you want in Seaborn plots, but as I'm learning, nobody does original work in data science (I kid!). Seaborn can tap into colorbrewer, whose color schemes are illustrated here. As an example, here's a horizontal bar chart using some of the data I provided:

which was generated with the following code:

I turned the chart sideways by providing the categories to the y-axis. Clever! If you feed Seaborn numerical data on one axis, it plots the bars on that axis. If both axes are non-numerical, it throws an error.

One of my .csv files had commas marking the thousands place for some reason, and Python imported these numbers as strings. Seaborn was very unhappy. If this happens, you can convert the strings back into numbers in Python, or you can fix your .csv manually.

Multiple columns

I had problems when I wanted to plot more than one category of data. The documentation on data structure for Seaborn is hard to find or doesn't exist, and I had to suss out what it was looking for. I first tried feeding it the following:

which generated this incorrect bar chart:

What happened here is that Seaborn thought I wanted the bars to correspond to the average of the columns, and the black lines to be the standard deviations. In the tutorial, the columns are supplied as values of a single category, which was not a feature of my data set. The solution was to "massage" the dataframe from the raw input:

into something that looked like this:

Here, I've used the "melt" function in Pandas to map the column names into values of the second column, effectively adding a new variable called "variable" whose values are in (degrees_per_100k, phys_deg_100k). I can now tell Seaborn that the "hue" of the data set is controlled by "variable" and that the bar heights are controlled by "value". The code now looks like this:

which results in this plot:

That's the extent of my limited experience with Seaborn, but I will surely continue using it. I'm pretty impressed so far.

Physics to data science

Saturday, August 22, 2015

Data visualization using Seaborn in Python