If you'd like to follow along, here's a link to the .csv files I'm using for this post.
Seaborn
Seaborn specializes in plotting categorical data, visualizing linear relationships [something else?]. It handles uncertainties very well, plotting standard deviation bars, and linear regressions by default. It runs off the back of Matplotlib, another plotting package.The Seaborn website includes a "tutorial" and a gallery, but the tutorial is very limited, and frankly, not basic enough for me. Here, I'll show a couple of examples in more detail.
Here's an example of what Seaborn can do:
Input data
You can feed Seaborn a variety of data formats, but it's convenient to use DataFrames, since (1) it's used in Pandas, and (2) it's a damned elegant way to represent data. I didn't know anything about Pandas when I started this project, and it took me a few false starts to get the .csv files into a form that Seaborn liked. Here's how I did it..csv to DataFrame
A comma separated value file looks like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"st_abbr","pop","degrees" | |
"CA",38802500,550509 | |
"NY",19746227,333690 | |
"FL",19893297,324106 | |
"TX",26956958,320099 | |
"IL",12880580,229623 | |
"PA",12787209,204882 | |
"AZ",6731484,192699 | |
"OH",11594163,167144 | |
"MI",9909877,150684 | |
"MA",6745408,135380 | |
"VA",8326289,134263 | |
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
st_abbr pop degrees | |
0 CA 38802500 550509 | |
1 NY 19746227 333690 | |
2 FL 19893297 324106 | |
3 TX 26956958 320099 | |
4 IL 12880580 229623 | |
5 PA 12787209 204882 | |
6 AZ 6731484 192699 | |
7 OH 11594163 167144 | |
8 MI 9909877 150684 | |
9 MA 6745408 135380 | |
10 VA 8326289 134263 | |
... |
1) The column headers have become labels, and are no longer part of the columns
2) A new index column was added to act as an identifying key
3) The data type of each column is stored in memory
Pandas gives us a way to import data from a .csv directly into a dataframe using the read_csv function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__author__ = 'bdeutsch' | |
# Import packages | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
#Import data | |
data1 = pd.read_csv('datasets/pop_vs_degrees.csv') |
Plotting
I used the function regplot to generate the above plot. It generates a scatter plot, and it automatically does a regression and plots the best fit along with the 95% confidence predictions [Note: It causes me physical pain to plot a linear regression and confidence interval when I have no reliable information about the random process generating my data. In my defense, I'm ignoring the results of the regression entirely]. As inputs, it takes the DataFrame containing the data, and it takes references (in the form of the column label strings) to the columns we want to plot. The above plot just needs two columns - one for each axis.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__author__ = 'bdeutsch' | |
# Import packages | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
#Import data | |
data1 = pd.read_csv('datasets/pop_vs_degrees.csv') | |
print data1 | |
#Rescale the data | |
data1['degrees'] = data1['degrees']/10000 | |
data1['pop'] = data1['pop']/1000000 | |
#Set the context and font size | |
sns.set_context("poster", font_scale=1) | |
#Create a figure and set its size | |
plt.figure(figsize=(10, 8)) | |
#Create the regression plot, where we supply x and y columns. "g" = green. | |
g = sns.regplot(x='pop', y='degrees', data = data1, color= "g") | |
#Set the plot limits, axis labels, and chart title. | |
g.set(xlim= (0), ylim= (0), xlabel= "Population [millions]", ylabel= "Total degrees/awards in 2013 [x10k]", title= "Total awards vs. population by state") | |
#Show the plot | |
sns.plt.show() | |
#Once the plot comes up, save as a .pdf for post-processing. |
We have dozens of options to tweak the appearance of this graph, but the raw output already looks better than about 95% of the graphs I've published. The plot window has a save option, and you can export the figure as a .pdf and then edit it in any vector graphic program (I use both Adobe Illustrator and Inkscape). But if you're a baller or a masochist, you might prefer to modify it in Python.
The plots take something like a style sheet, where you can choose a theme based on what you're using the graphic for. It changes line thicknesses and font sizes, among other things, for slides, papers, or posters. Change it with the set_context command.
Color schemes
What would we sink our copious free time into if it weren't for color scheme choices?You're free to define whatever colors you want in Seaborn plots, but as I'm learning, nobody does original work in data science (I kid!). Seaborn can tap into colorbrewer, whose color schemes are illustrated here. As an example, here's a horizontal bar chart using some of the data I provided:
which was generated with the following code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__author__ = 'bdeutsch' | |
# Import packages | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
#Import data | |
data1 = pd.read_csv('datasets/opt_by_inst.csv') | |
#Set context, increase font size | |
sns.set_context("poster", font_scale=1.5) | |
#Create a figure | |
plt.figure(figsize=(15, 8)) | |
#Define the axis object | |
ax = sns.barplot(x='optics graduates', y='inst_nm', data=data1, palette="Blues_d") | |
#set paramters | |
ax.set(xlabel='Optics PhDs awarded', ylabel='Institution name', title= "Top optics-granting institutions") | |
#show the plot | |
sns.plt.show() |
One of my .csv files had commas marking the thousands place for some reason, and Python imported these numbers as strings. Seaborn was very unhappy. If this happens, you can convert the strings back into numbers in Python, or you can fix your .csv manually.
Multiple columns
I had problems when I wanted to plot more than one category of data. The documentation on data structure for Seaborn is hard to find or doesn't exist, and I had to suss out what it was looking for. I first tried feeding it the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__author__ = 'bdeutsch' | |
# Import packages | |
import numpy as np | |
import seaborn as sns | |
import matplotlib as mpl | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
import pylab | |
#Import data | |
data1 = pd.read_csv('datasets/phys_phd_by_state.csv') | |
#Scale the data | |
data1['phys_phd_100k'] = data1['phys_phd_100k'] * 50; | |
#set context | |
sns.set_context("poster", font_scale=1) | |
#create figure | |
plt.figure(figsize=(8, 8)) | |
#create axis object as a bar plot | |
ax = sns.barplot(data=data1, palette="Blues_d") | |
#set axis labes, title | |
ax.set(xlabel='Awards per 100k people', ylabel='State/district', title= "Degrees and physics PhDs by state") | |
#Show the plot | |
sns.plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
state degrees_per_100k phys_phd_100k | |
0 District of Columbia 4175.7918 1434.220 | |
1 Arizona 2862.6525 1061.430 | |
2 Iowa 2846.8430 706.440 | |
3 Utah 2130.5160 948.040 | |
4 Rhode Island 2032.8420 1113.560 | |
... |
into something that looked like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
state variable value | |
0 District of Columbia degrees_per_100k 4175.7918 | |
1 Arizona degrees_per_100k 2862.6525 | |
2 Iowa degrees_per_100k 2846.8430 | |
3 Utah degrees_per_100k 2130.5160 | |
4 Rhode Island degrees_per_100k 2032.8420 | |
5 Massachusetts degrees_per_100k 2006.9949 | |
.. ... ... ... | |
97 New Jersey phys_phd_100k 564.4300 | |
98 Montana phys_phd_100k 874.3800 | |
99 Hawaii phys_phd_100k 405.0500 | |
100 Alaska phys_phd_100k 481.8550 | |
101 Nevada phys_phd_100k 345.1750 |
Here, I've used the "melt" function in Pandas to map the column names into values of the second column, effectively adding a new variable called "variable" whose values are in (degrees_per_100k, phys_deg_100k). I can now tell Seaborn that the "hue" of the data set is controlled by "variable" and that the bar heights are controlled by "value". The code now looks like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
__author__ = 'bdeutsch' | |
# Import packages | |
import seaborn as sns | |
import matplotlib.pyplot as plt | |
import pandas as pd | |
#Import data | |
data1 = pd.read_csv('datasets/phys_phd_by_state.csv') | |
# Rescale data | |
data1['phys_phd_100k'] = data1['phys_phd_100k'] * 50; | |
#Reshape into correct form for Seaborn bar chart | |
data2 = pd.melt(data1, id_vars=['state'], value_vars=['degrees_per_100k', 'phys_phd_100k']) | |
#made font smaller to fit into one frame | |
sns.set_context("poster", font_scale=.5) | |
plt.figure(figsize=(15, 15)) | |
ax = sns.barplot(y='state', x='value', hue = 'variable', data=data2, palette="Blues_d") | |
ax.set(xlabel='Awards per 100k people', ylabel='State/district') | |
# To get the legend to show custom labels, we need the following code. | |
patches, labels = ax.get_legend_handles_labels() | |
#Here we pass the axis object information to the legend object through "patches." | |
ax.legend(patches, ["All degrees", "Physics PhDs (x50)"], ncol=1, loc="lower right", frameon=True) | |
#show plot | |
sns.plt.show() |
which results in this plot:
That's the extent of my limited experience with Seaborn, but I will surely continue using it. I'm pretty impressed so far.