Physics to data science: Secondary education data: analysis with SQL and Python/Seaborn

I recently looked at a collection of data from 2013 published by the Integrated Postsecondary Education Data System. Every year they take a census of postsecondary educational institutions in the United States, who report all kinds of information about their enrollment, costs, staff, and degrees awarded. My goal was to learn how to use SQL to interact with databases, and how to visualize data using Python.

Let's start with what I found out first, and then I'll talk about where I got the data, and how I imported it and used MySQL to analyze it. I used Python/Seaborn/Matplotlib to visualize it, but I'll go over that in a future post.

What I found:

Each state awards degrees roughly in proportion to its population. The largest in each category is California, and there are two sort-of-outliers. Arizona awards disproportionatly more degrees due to the presence of the University of Phoenix online, and Texas awards proportionally fewer for some reason I don't know (suggestions?). This chart uses the total state population, including those under 18 years.
Degrees vs. population for each state
Women greatly outnumber men in postsecondary education, representing about 59% of all awardees and 58% of awardees of master's degrees or higher. Here are the general fields of study that have the highest percentage of women and men.

Greatest representation among women and men
Some states are highly under- or over-represented among Physical Science PhD's, proportionally. After scaling to the population of each state, the trend is linear-ish. Outliers include MA, home of MIT, Harvard, and others, along with Florida, where apparently physics isn't so popular.

Physical Science Phd's vs all degrees, scaled to population

Here is another representation of the same data, where it becomes pretty clear that total degrees per population is not a great indicator for proportion of people with Physics PhD's. Also, this list reveals that D.C. has crazy amounts of degree holders and Physics PhD's.

Physical Science Phd's and all degrees, scaled to population
Out of curiosity, I wanted to see who was awarding Optics PhD's. I went to #3, The University of Rochester, for mine.

Institutions ranked by number of Optics PhD's awarded in 2012-13

Analysis methods

If you're more interested in education than data science, this is probably a good stopping point. But I did this project to sharpen my teeth on database interaction and Python visualization, so here we go!

The data

I made a database containing five tables regarding the 2013 IPEDS surveys and supplementary information. Here are links to the data files, along with descriptions.

Directory of institutions: Contains a unique ID number for each postsecondary institution, its name, address, state (in abbreviated form), and contact information.
Completions: For each institution ID, this table contains the number of degrees or certificates awarded in each subject area to men and women of various ethnic groups at all levels up through professional/research doctorates. Subject areas are catalogued by a "CIP code," which is a taxonomy system.
CIP codes dictionary: For each CIP code, this table contains a title ("Underwater Basket Weaving", e.g.) and a short description.
Population by state: I wanted to compare the number of degrees to the total population of each state, so I pulled in this chart. Note that each state is indexed by its full name.
State abbreviations: I could have manually changed all the state names in the population table to their postal abbreviations... but I'm lazy. So instead I found this table of abbreviations and let MySQL do it for me.

For the IPEDS data, there are corresponding dictionaries that describe the column names. I ended up needing only a few of them.

The setup

I wanted to use the language SQL to interact with my data. The steps to get there were roughly:

Choose and install a program to host a database server on my local machine. This server will take instructions in SQL, either from the command line or from some kind of GUI. I chose MySQL Community Server, but Microsoft's SQL server is a viable alternative.
(Optional) Find a program to interact with the database server. At first I just worked from the command line, but I ended up installing Sequel Pro, a free program that's quite easy to use. You can also use Python to interact with MySQL, which is convenient for sophisticated analyses. I'm currently set up to do that, but I didn't use it for this project. This tutorial shows you how to do it.
Import the data. Here's how.

Importing the data from the command prompt

You've downloaded the data sets as .csv files, and it's time to create a database where each data file will end up being a table. Assuming you're working from the command prompt, log into MySQL by typing "mysql -u root -p" (where I use the root user to make sure I have full privileges. You can also create a new user with a password, but I'm not going to have anyone else accessing my server).

Next, create a database using "create database db_name_here"

This database is empty, and we need to fill it with tables corresponding to our .csv files. From the command line, you can create a table like this:

where "...." stands for whatever other columns you need. If we want to make a table for the directory data, for example, column_name1 is the unit ID with type INT. Once the database is created, you can use the BULK INSERT command to import the .csv, but I didn't do that. Instead, I used Sequel Pro to do it, as shown below.

Importing the data using Sequel Pro

After connecting to the server with Sequel Pro, you can select a database to use, and then click on the + in the bottom-left corner to create a new table. The column names and data types get added one by one, and you can feel free to only add the ones you really need to use. Here's what the table structure looks like for the "directory" information:

Clicking File->Import brings up a file dialog where you can choose your .csv file. Next, an import window shows up:

The left column shows the column names in the CSV, and the middle column shows the columns you just created. You can have it match the fields in order, or by name if the names are identical. You can also manually select them by clicking on the CSV field entry. Make sure "First line contains field names" is checked before clicking Import. Switching over to the content tab, we see that the .csv imported correctly.

This process has to be repeated for each table you'd like to import.

Notes: I had to add a row to the cipcode table for "99 = Total," since the completions table uses that notation. I also chose to convert the codes in this table to floats instead of strings, since I wanted to round them. Rounded CIP codes give the broader subject field.

The analysis

I spent some time just exporing and poking at the data, iteratively refining my searches until I found a few interesting things to report on. Here are some of examples of the kind of analysis I did.

Most popular areas of study across all award levels (not shown in charts, but healthcare crushes everything else)

"ctotalt" is the total number of degrees of a particular subject and level awarded at a particular institution. I use INNER JOIN to retrieve the name of the area of study from the cips table. I called the CIP code "ccode" in that table so it had a unique name. Otherwise, you can call it by "table.column_name" to avoid ambiguity.

Total degrees and physics degrees by state

Here I use a CASE to count degrees for two different CIP codes. This may not be best practice, but it works. I wanted to scale by the population of each state, so I had to join three databases. "abbrev" gives me the full state name so I can look it up in "population". I'm using a pre-filtered version of "cips" where I've eliminated all of the specific fields of study (any classification beyond the decimal place) to cut down on query time. I didn't actually need to join directory. Note that CIP code = 99 is used in "completions" to indicate the total among all CIP codes, while CIP = 40 indicates Physics-related studies.

Representation of women across all degrees, and for English language and literature degrees

This is a straightforward adaptation of the above. I want the proportion of degrees awarded to women, so I take the ratio ctotalw/ctotalt, and only count it when my conditions are met.

Representation of women by general area of study

Here, I just need to avoid counting the "total" amounts for each institution (indicated by cipcode = 99). The symbol "<>" means "not equal to."

Population vs. degrees awarded for each state

Here I need four tables. The population requires "population" and "abbrev" to get state names, and total degrees by state requires "completions," and "directory" gives me the state in which each institution resides. I include the filtered version of "cips" here because it was easier than deleting it.

Population vs. professional doctorates for each state

Same, but with the requirement that "award_lvl = 17." The award levels can be found here. 17 excludes professional degrees like MD's.

Optics PhD's by institution

A subject near to my heart, and an easy query. Instead of searching for the CIP codes corresponding to optics, I search for the word "optical" in the CIP title. This is in general a terrible strategy, but I checked ahead of time that this gets everything I want and nothing I don't.

Physics to data science

Thursday, August 20, 2015

Secondary education data: analysis with SQL and Python/Seaborn

What I found:

Degrees vs. population for each state

Greatest representation among women and men

Physical Science Phd's and all degrees, scaled to population

Institutions ranked by number of Optics PhD's awarded in 2012-13