Physics to data science: August 2015

Saturday, August 22, 2015

Data visualization using Seaborn in Python

Last time, I presented an analysis of some education data available from IPEDS. For the visualization, I used a Python package called Seaborn. After about eight years of using MATLAB and Mathematica for plotting, I was astounded by the quality of the plots. Here, I want to talk a bit about Seaborn, and the learning curve I ascended.

If you'd like to follow along, here's a link to the .csv files I'm using for this post.

Seaborn

Seaborn specializes in plotting categorical data, visualizing linear relationships [something else?]. It handles uncertainties very well, plotting standard deviation bars, and linear regressions by default. It runs off the back of Matplotlib, another plotting package.

The Seaborn website includes a "tutorial" and a gallery, but the tutorial is very limited, and frankly, not basic enough for me. Here, I'll show a couple of examples in more detail.

Here's an example of what Seaborn can do:

Input data

You can feed Seaborn a variety of data formats, but it's convenient to use DataFrames, since (1) it's used in Pandas, and (2) it's a damned elegant way to represent data. I didn't know anything about Pandas when I started this project, and it took me a few false starts to get the .csv files into a form that Seaborn liked. Here's how I did it.

.csv to DataFrame

A comma separated value file looks like this:

while a dataframe of the same simple data looks like this:

Some differences are:
1) The column headers have become labels, and are no longer part of the columns
2) A new index column was added to act as an identifying key
3) The data type of each column is stored in memory

Pandas gives us a way to import data from a .csv directly into a dataframe using the read_csv function:

Note that we loaded some packages here, and called them by shorter aliases so we don't have to type long names later.

Plotting

I used the function regplot to generate the above plot. It generates a scatter plot, and it automatically does a regression and plots the best fit along with the 95% confidence predictions [Note: It causes me physical pain to plot a linear regression and confidence interval when I have no reliable information about the random process generating my data. In my defense, I'm ignoring the results of the regression entirely]. As inputs, it takes the DataFrame containing the data, and it takes references (in the form of the column label strings) to the columns we want to plot. The above plot just needs two columns - one for each axis.

Calling regplot returns an "axis" object. Next, we have to tell Python to put that object into a plot and show it. We use the srs.plt.show() command:

We have dozens of options to tweak the appearance of this graph, but the raw output already looks better than about 95% of the graphs I've published. The plot window has a save option, and you can export the figure as a .pdf and then edit it in any vector graphic program (I use both Adobe Illustrator and Inkscape). But if you're a baller or a masochist, you might prefer to modify it in Python.
The plots take something like a style sheet, where you can choose a theme based on what you're using the graphic for. It changes line thicknesses and font sizes, among other things, for slides, papers, or posters. Change it with the set_context command.

Color schemes

What would we sink our copious free time into if it weren't for color scheme choices?

You're free to define whatever colors you want in Seaborn plots, but as I'm learning, nobody does original work in data science (I kid!). Seaborn can tap into colorbrewer, whose color schemes are illustrated here. As an example, here's a horizontal bar chart using some of the data I provided:

which was generated with the following code:

I turned the chart sideways by providing the categories to the y-axis. Clever! If you feed Seaborn numerical data on one axis, it plots the bars on that axis. If both axes are non-numerical, it throws an error.

One of my .csv files had commas marking the thousands place for some reason, and Python imported these numbers as strings. Seaborn was very unhappy. If this happens, you can convert the strings back into numbers in Python, or you can fix your .csv manually.

Multiple columns

I had problems when I wanted to plot more than one category of data. The documentation on data structure for Seaborn is hard to find or doesn't exist, and I had to suss out what it was looking for. I first tried feeding it the following:

which generated this incorrect bar chart:

What happened here is that Seaborn thought I wanted the bars to correspond to the average of the columns, and the black lines to be the standard deviations. In the tutorial, the columns are supplied as values of a single category, which was not a feature of my data set. The solution was to "massage" the dataframe from the raw input:

into something that looked like this:

Here, I've used the "melt" function in Pandas to map the column names into values of the second column, effectively adding a new variable called "variable" whose values are in (degrees_per_100k, phys_deg_100k). I can now tell Seaborn that the "hue" of the data set is controlled by "variable" and that the bar heights are controlled by "value". The code now looks like this:

which results in this plot:

That's the extent of my limited experience with Seaborn, but I will surely continue using it. I'm pretty impressed so far.

Thursday, August 20, 2015

Secondary education data: analysis with SQL and Python/Seaborn

I recently looked at a collection of data from 2013 published by the Integrated Postsecondary Education Data System. Every year they take a census of postsecondary educational institutions in the United States, who report all kinds of information about their enrollment, costs, staff, and degrees awarded. My goal was to learn how to use SQL to interact with databases, and how to visualize data using Python.

Let's start with what I found out first, and then I'll talk about where I got the data, and how I imported it and used MySQL to analyze it. I used Python/Seaborn/Matplotlib to visualize it, but I'll go over that in a future post.

What I found:

Each state awards degrees roughly in proportion to its population. The largest in each category is California, and there are two sort-of-outliers. Arizona awards disproportionatly more degrees due to the presence of the University of Phoenix online, and Texas awards proportionally fewer for some reason I don't know (suggestions?). This chart uses the total state population, including those under 18 years.
Degrees vs. population for each state
Women greatly outnumber men in postsecondary education, representing about 59% of all awardees and 58% of awardees of master's degrees or higher. Here are the general fields of study that have the highest percentage of women and men.

Greatest representation among women and men
Some states are highly under- or over-represented among Physical Science PhD's, proportionally. After scaling to the population of each state, the trend is linear-ish. Outliers include MA, home of MIT, Harvard, and others, along with Florida, where apparently physics isn't so popular.

Physical Science Phd's vs all degrees, scaled to population

Here is another representation of the same data, where it becomes pretty clear that total degrees per population is not a great indicator for proportion of people with Physics PhD's. Also, this list reveals that D.C. has crazy amounts of degree holders and Physics PhD's.

Physical Science Phd's and all degrees, scaled to population
Out of curiosity, I wanted to see who was awarding Optics PhD's. I went to #3, The University of Rochester, for mine.

Institutions ranked by number of Optics PhD's awarded in 2012-13

Analysis methods

If you're more interested in education than data science, this is probably a good stopping point. But I did this project to sharpen my teeth on database interaction and Python visualization, so here we go!

The data

I made a database containing five tables regarding the 2013 IPEDS surveys and supplementary information. Here are links to the data files, along with descriptions.

Directory of institutions: Contains a unique ID number for each postsecondary institution, its name, address, state (in abbreviated form), and contact information.
Completions: For each institution ID, this table contains the number of degrees or certificates awarded in each subject area to men and women of various ethnic groups at all levels up through professional/research doctorates. Subject areas are catalogued by a "CIP code," which is a taxonomy system.
CIP codes dictionary: For each CIP code, this table contains a title ("Underwater Basket Weaving", e.g.) and a short description.
Population by state: I wanted to compare the number of degrees to the total population of each state, so I pulled in this chart. Note that each state is indexed by its full name.
State abbreviations: I could have manually changed all the state names in the population table to their postal abbreviations... but I'm lazy. So instead I found this table of abbreviations and let MySQL do it for me.

For the IPEDS data, there are corresponding dictionaries that describe the column names. I ended up needing only a few of them.

The setup

I wanted to use the language SQL to interact with my data. The steps to get there were roughly:

Choose and install a program to host a database server on my local machine. This server will take instructions in SQL, either from the command line or from some kind of GUI. I chose MySQL Community Server, but Microsoft's SQL server is a viable alternative.
(Optional) Find a program to interact with the database server. At first I just worked from the command line, but I ended up installing Sequel Pro, a free program that's quite easy to use. You can also use Python to interact with MySQL, which is convenient for sophisticated analyses. I'm currently set up to do that, but I didn't use it for this project. This tutorial shows you how to do it.
Import the data. Here's how.

Importing the data from the command prompt

You've downloaded the data sets as .csv files, and it's time to create a database where each data file will end up being a table. Assuming you're working from the command prompt, log into MySQL by typing "mysql -u root -p" (where I use the root user to make sure I have full privileges. You can also create a new user with a password, but I'm not going to have anyone else accessing my server).

Next, create a database using "create database db_name_here"

This database is empty, and we need to fill it with tables corresponding to our .csv files. From the command line, you can create a table like this:

where "...." stands for whatever other columns you need. If we want to make a table for the directory data, for example, column_name1 is the unit ID with type INT. Once the database is created, you can use the BULK INSERT command to import the .csv, but I didn't do that. Instead, I used Sequel Pro to do it, as shown below.

Importing the data using Sequel Pro

After connecting to the server with Sequel Pro, you can select a database to use, and then click on the + in the bottom-left corner to create a new table. The column names and data types get added one by one, and you can feel free to only add the ones you really need to use. Here's what the table structure looks like for the "directory" information:

Clicking File->Import brings up a file dialog where you can choose your .csv file. Next, an import window shows up:

The left column shows the column names in the CSV, and the middle column shows the columns you just created. You can have it match the fields in order, or by name if the names are identical. You can also manually select them by clicking on the CSV field entry. Make sure "First line contains field names" is checked before clicking Import. Switching over to the content tab, we see that the .csv imported correctly.

This process has to be repeated for each table you'd like to import.

Notes: I had to add a row to the cipcode table for "99 = Total," since the completions table uses that notation. I also chose to convert the codes in this table to floats instead of strings, since I wanted to round them. Rounded CIP codes give the broader subject field.

The analysis

I spent some time just exporing and poking at the data, iteratively refining my searches until I found a few interesting things to report on. Here are some of examples of the kind of analysis I did.

Most popular areas of study across all award levels (not shown in charts, but healthcare crushes everything else)

"ctotalt" is the total number of degrees of a particular subject and level awarded at a particular institution. I use INNER JOIN to retrieve the name of the area of study from the cips table. I called the CIP code "ccode" in that table so it had a unique name. Otherwise, you can call it by "table.column_name" to avoid ambiguity.

Total degrees and physics degrees by state

Here I use a CASE to count degrees for two different CIP codes. This may not be best practice, but it works. I wanted to scale by the population of each state, so I had to join three databases. "abbrev" gives me the full state name so I can look it up in "population". I'm using a pre-filtered version of "cips" where I've eliminated all of the specific fields of study (any classification beyond the decimal place) to cut down on query time. I didn't actually need to join directory. Note that CIP code = 99 is used in "completions" to indicate the total among all CIP codes, while CIP = 40 indicates Physics-related studies.

Representation of women across all degrees, and for English language and literature degrees

This is a straightforward adaptation of the above. I want the proportion of degrees awarded to women, so I take the ratio ctotalw/ctotalt, and only count it when my conditions are met.

Representation of women by general area of study

Here, I just need to avoid counting the "total" amounts for each institution (indicated by cipcode = 99). The symbol "<>" means "not equal to."

Population vs. degrees awarded for each state

Here I need four tables. The population requires "population" and "abbrev" to get state names, and total degrees by state requires "completions," and "directory" gives me the state in which each institution resides. I include the filtered version of "cips" here because it was easier than deleting it.

Population vs. professional doctorates for each state

Same, but with the requirement that "award_lvl = 17." The award levels can be found here. 17 excludes professional degrees like MD's.

Optics PhD's by institution

A subject near to my heart, and an easy query. Instead of searching for the CIP codes corresponding to optics, I search for the word "optical" in the CIP title. This is in general a terrible strategy, but I checked ahead of time that this gets everything I want and nothing I don't.

Saturday, August 15, 2015

Code snippets in Blogger

I'm going to be including some code blocks soon, and I found this quick method. You just create a document on gist.github.com with the appropriate extension, and it generates a link to embed the code block. Here are some examples:

Python:

SQL:

C++:

Hopefully next time I'll have some SQL results to share.

Wednesday, August 12, 2015

Setting up a development environment

I've mostly used my work machine for data science training, since it had many of the utilities I needed already, and also it was a computation beast. But I wanted to get my Macbook Pro into a condition to be my main development machine, since (1) it's portable, and (2) I don't work there no more.

For the most part, I followed steps 2-8 of the tutorial here to set up the environment, skipping the system preferences and stopping before installing C++, Ruby, and other developer software. I'll get there eventually, but I'm taking baby steps. I also already have LaTeX running on the Macbook, so I didn't need that.

Step 6 is the installation of Sublime Text, which is a pretty rad text editor. You can use the "evaluation version" for free forever, or you can pay the $70 to register it. It does a lot of cool stuff, including automatically setting the syntax for whatever programming language you're using. For example, if I create a new file and save it with the extension *.py, I can create a function by typing "def", and Sublime Text fills in the following:

Everything has been marked up in color, and the proper function syntax has been filled in. The tutorial includes the installation of various packages that set the color scheme, as well as mark up possible errors in the program. The red block indicates a perceived error, where it's worried that there's a tab in the line (there is, but it's equivalent to four spaces, which is the correct syntax).

You can also run programs from Sublime Text using its console:

Note the bit at the bottom. Once I wrote the function and called it, I used Sublime Text's "build" command with Python.

Step 8 in the tutorial involves the installation of MySQL, which I'll talk more about in a future post. Today I went through a couple of tutorials, and to be honest, it took me a while to get it working. In short, SQL is a language for interacting with databases. To use it, you need some databases, which are typically stored on a "server," which can just be your computer. There are several programs that can be used to maintain and interact with these database servers, and MySQL is one of them. It's open source, and the community edition is free. Here is an SQL tutorial, and here is another one.

I think I have the essentials set up now. My original goal today was to learn the basics of SQL and start a mini-project. Hopefully I'll get to the mini-project this weekend.

Sunday, August 9, 2015

Insight

I'm excited to announce that I recently accepted an invitation to be an Insight Data Science Fellow in New York City. I want to tell you a bit about the application and interview process in case you find yourself in a similar situation. But first, here's a bit about the program.

The Insight Fellowship

The Insight program is intended for people with PhD's in any field, and who intend to become data scientists in New York City or the SF Bay Area. Over the course of seven weeks, fellows put together an individual project that showcases their data science abilities, and the Insight crew helps prepare them for interviews. Starting in week eight, interviews begin. As of right now, with 300 alumni, the insight program has a 100% placement rate.

It's free of charge, since they work on something like a recruiter model. They have corporate partners that come to them for recruits, and Insight gets paid if and when a fellow accepts a job offer. I don't know the details, but if it's similar to other recruiters, the amount of the payment is a percentage of the base salary accepted, so they have an incentive to help the fellows negotiate well.

The program is selective, and involves an application and an interview. I heard from one of the Insight directors (from the SF Bay area), Kathy Copic, that the acceptance rate is something like 7%. They're looking for people who have an excellent chance of getting placed at a company, since that's the only way they get paid. Here's how the process works.

The application

First you complete a web form with basic biographical information and a few short answer blocks. They want to know what programming languages you know, and how familiar you are with each. They ask about what kinds of data you have worked with, and what types of analyses you've done, including side projects. Finally, there's a space to explain why you want to be a data scientist. Each of these is limited to around one paragraph.

The interview

If you make the first cut, a Skype interview is scheduled, and lasts about a half hour. They request that you prepare a short "demo" showing they types of data analysis you typically do. My work is mostly done in MATLAB, but I wanted to show that I was familiar with Python, so I did a short project the weekend before the interview, which resulted in the previous blog posts on Twitter data mining.

The interview is extremely relaxed. They say specifically that you're not required to dress up for it, and it's really just a conversation with one of the directors. I spoke with Genevieve, who told me about her career trajectory (she was accepted into the Insight program after her PhD, and was then invited to stay as a director). I talked a bit about my background in nano optics and computational/experimental optics, and about the experimental data I tend to collect.

For the demo, I shared my desktop and showed Genevieve some example MATLAB code that performed some basic noise analysis. I had written it a couple of weeks before to characterize an experimental system. Then I showed the Python code and the results of the Twitter experiment. We talked a little bit about what companies I would be interested in working with, and why I wanted to be a data scientist. At the end she answered a couple of my questions before we wrapped up.

Interviews spanned 11 days. I heard back about my acceptance on the 12th day. I don't know what the acceptance rate was this year, or even how many colleagues will join me as Insight Fellows. I also don't know what percentage of applicants made the first cut but not the second. But hopefully this post will help an applicant for a future session.

The interim

I just moved to NYC with my family, and will begin the program in about a month. In the mean time I'm going to continue familiarizing myself with the basic tools of the data scientist, and getting used to the area. I can tell you so far that the bagels are excellent, and even the "bad" Chinese food is pretty good compared to that in Central Illinois.