Physics to data science

Saturday, August 22, 2015

Data visualization using Seaborn in Python

Last time, I presented an analysis of some education data available from IPEDS. For the visualization, I used a Python package called Seaborn. After about eight years of using MATLAB and Mathematica for plotting, I was astounded by the quality of the plots. Here, I want to talk a bit about Seaborn, and the learning curve I ascended.

If you'd like to follow along, here's a link to the .csv files I'm using for this post.

Seaborn

Seaborn specializes in plotting categorical data, visualizing linear relationships [something else?]. It handles uncertainties very well, plotting standard deviation bars, and linear regressions by default. It runs off the back of Matplotlib, another plotting package.

The Seaborn website includes a "tutorial" and a gallery, but the tutorial is very limited, and frankly, not basic enough for me. Here, I'll show a couple of examples in more detail.

Here's an example of what Seaborn can do:

Input data

You can feed Seaborn a variety of data formats, but it's convenient to use DataFrames, since (1) it's used in Pandas, and (2) it's a damned elegant way to represent data. I didn't know anything about Pandas when I started this project, and it took me a few false starts to get the .csv files into a form that Seaborn liked. Here's how I did it.

.csv to DataFrame

A comma separated value file looks like this:

while a dataframe of the same simple data looks like this:

Some differences are:
1) The column headers have become labels, and are no longer part of the columns
2) A new index column was added to act as an identifying key
3) The data type of each column is stored in memory

Pandas gives us a way to import data from a .csv directly into a dataframe using the read_csv function:

Note that we loaded some packages here, and called them by shorter aliases so we don't have to type long names later.

Plotting

I used the function regplot to generate the above plot. It generates a scatter plot, and it automatically does a regression and plots the best fit along with the 95% confidence predictions [Note: It causes me physical pain to plot a linear regression and confidence interval when I have no reliable information about the random process generating my data. In my defense, I'm ignoring the results of the regression entirely]. As inputs, it takes the DataFrame containing the data, and it takes references (in the form of the column label strings) to the columns we want to plot. The above plot just needs two columns - one for each axis.

Calling regplot returns an "axis" object. Next, we have to tell Python to put that object into a plot and show it. We use the srs.plt.show() command:

We have dozens of options to tweak the appearance of this graph, but the raw output already looks better than about 95% of the graphs I've published. The plot window has a save option, and you can export the figure as a .pdf and then edit it in any vector graphic program (I use both Adobe Illustrator and Inkscape). But if you're a baller or a masochist, you might prefer to modify it in Python.
The plots take something like a style sheet, where you can choose a theme based on what you're using the graphic for. It changes line thicknesses and font sizes, among other things, for slides, papers, or posters. Change it with the set_context command.

Color schemes

What would we sink our copious free time into if it weren't for color scheme choices?

You're free to define whatever colors you want in Seaborn plots, but as I'm learning, nobody does original work in data science (I kid!). Seaborn can tap into colorbrewer, whose color schemes are illustrated here. As an example, here's a horizontal bar chart using some of the data I provided:

which was generated with the following code:

I turned the chart sideways by providing the categories to the y-axis. Clever! If you feed Seaborn numerical data on one axis, it plots the bars on that axis. If both axes are non-numerical, it throws an error.

One of my .csv files had commas marking the thousands place for some reason, and Python imported these numbers as strings. Seaborn was very unhappy. If this happens, you can convert the strings back into numbers in Python, or you can fix your .csv manually.

Multiple columns

I had problems when I wanted to plot more than one category of data. The documentation on data structure for Seaborn is hard to find or doesn't exist, and I had to suss out what it was looking for. I first tried feeding it the following:

which generated this incorrect bar chart:

What happened here is that Seaborn thought I wanted the bars to correspond to the average of the columns, and the black lines to be the standard deviations. In the tutorial, the columns are supplied as values of a single category, which was not a feature of my data set. The solution was to "massage" the dataframe from the raw input:

into something that looked like this:

Here, I've used the "melt" function in Pandas to map the column names into values of the second column, effectively adding a new variable called "variable" whose values are in (degrees_per_100k, phys_deg_100k). I can now tell Seaborn that the "hue" of the data set is controlled by "variable" and that the bar heights are controlled by "value". The code now looks like this:

which results in this plot:

That's the extent of my limited experience with Seaborn, but I will surely continue using it. I'm pretty impressed so far.

Thursday, August 20, 2015

Secondary education data: analysis with SQL and Python/Seaborn

I recently looked at a collection of data from 2013 published by the Integrated Postsecondary Education Data System. Every year they take a census of postsecondary educational institutions in the United States, who report all kinds of information about their enrollment, costs, staff, and degrees awarded. My goal was to learn how to use SQL to interact with databases, and how to visualize data using Python.

Let's start with what I found out first, and then I'll talk about where I got the data, and how I imported it and used MySQL to analyze it. I used Python/Seaborn/Matplotlib to visualize it, but I'll go over that in a future post.

What I found:

Each state awards degrees roughly in proportion to its population. The largest in each category is California, and there are two sort-of-outliers. Arizona awards disproportionatly more degrees due to the presence of the University of Phoenix online, and Texas awards proportionally fewer for some reason I don't know (suggestions?). This chart uses the total state population, including those under 18 years.
Degrees vs. population for each state
Women greatly outnumber men in postsecondary education, representing about 59% of all awardees and 58% of awardees of master's degrees or higher. Here are the general fields of study that have the highest percentage of women and men.

Greatest representation among women and men
Some states are highly under- or over-represented among Physical Science PhD's, proportionally. After scaling to the population of each state, the trend is linear-ish. Outliers include MA, home of MIT, Harvard, and others, along with Florida, where apparently physics isn't so popular.

Physical Science Phd's vs all degrees, scaled to population

Here is another representation of the same data, where it becomes pretty clear that total degrees per population is not a great indicator for proportion of people with Physics PhD's. Also, this list reveals that D.C. has crazy amounts of degree holders and Physics PhD's.

Physical Science Phd's and all degrees, scaled to population
Out of curiosity, I wanted to see who was awarding Optics PhD's. I went to #3, The University of Rochester, for mine.

Institutions ranked by number of Optics PhD's awarded in 2012-13

Analysis methods

If you're more interested in education than data science, this is probably a good stopping point. But I did this project to sharpen my teeth on database interaction and Python visualization, so here we go!

The data

I made a database containing five tables regarding the 2013 IPEDS surveys and supplementary information. Here are links to the data files, along with descriptions.

Directory of institutions: Contains a unique ID number for each postsecondary institution, its name, address, state (in abbreviated form), and contact information.
Completions: For each institution ID, this table contains the number of degrees or certificates awarded in each subject area to men and women of various ethnic groups at all levels up through professional/research doctorates. Subject areas are catalogued by a "CIP code," which is a taxonomy system.
CIP codes dictionary: For each CIP code, this table contains a title ("Underwater Basket Weaving", e.g.) and a short description.
Population by state: I wanted to compare the number of degrees to the total population of each state, so I pulled in this chart. Note that each state is indexed by its full name.
State abbreviations: I could have manually changed all the state names in the population table to their postal abbreviations... but I'm lazy. So instead I found this table of abbreviations and let MySQL do it for me.

For the IPEDS data, there are corresponding dictionaries that describe the column names. I ended up needing only a few of them.

The setup

I wanted to use the language SQL to interact with my data. The steps to get there were roughly:

Choose and install a program to host a database server on my local machine. This server will take instructions in SQL, either from the command line or from some kind of GUI. I chose MySQL Community Server, but Microsoft's SQL server is a viable alternative.
(Optional) Find a program to interact with the database server. At first I just worked from the command line, but I ended up installing Sequel Pro, a free program that's quite easy to use. You can also use Python to interact with MySQL, which is convenient for sophisticated analyses. I'm currently set up to do that, but I didn't use it for this project. This tutorial shows you how to do it.
Import the data. Here's how.

Importing the data from the command prompt

You've downloaded the data sets as .csv files, and it's time to create a database where each data file will end up being a table. Assuming you're working from the command prompt, log into MySQL by typing "mysql -u root -p" (where I use the root user to make sure I have full privileges. You can also create a new user with a password, but I'm not going to have anyone else accessing my server).

Next, create a database using "create database db_name_here"

This database is empty, and we need to fill it with tables corresponding to our .csv files. From the command line, you can create a table like this:

where "...." stands for whatever other columns you need. If we want to make a table for the directory data, for example, column_name1 is the unit ID with type INT. Once the database is created, you can use the BULK INSERT command to import the .csv, but I didn't do that. Instead, I used Sequel Pro to do it, as shown below.

Importing the data using Sequel Pro

After connecting to the server with Sequel Pro, you can select a database to use, and then click on the + in the bottom-left corner to create a new table. The column names and data types get added one by one, and you can feel free to only add the ones you really need to use. Here's what the table structure looks like for the "directory" information:

Clicking File->Import brings up a file dialog where you can choose your .csv file. Next, an import window shows up:

The left column shows the column names in the CSV, and the middle column shows the columns you just created. You can have it match the fields in order, or by name if the names are identical. You can also manually select them by clicking on the CSV field entry. Make sure "First line contains field names" is checked before clicking Import. Switching over to the content tab, we see that the .csv imported correctly.

This process has to be repeated for each table you'd like to import.

Notes: I had to add a row to the cipcode table for "99 = Total," since the completions table uses that notation. I also chose to convert the codes in this table to floats instead of strings, since I wanted to round them. Rounded CIP codes give the broader subject field.

The analysis

I spent some time just exporing and poking at the data, iteratively refining my searches until I found a few interesting things to report on. Here are some of examples of the kind of analysis I did.

Most popular areas of study across all award levels (not shown in charts, but healthcare crushes everything else)

"ctotalt" is the total number of degrees of a particular subject and level awarded at a particular institution. I use INNER JOIN to retrieve the name of the area of study from the cips table. I called the CIP code "ccode" in that table so it had a unique name. Otherwise, you can call it by "table.column_name" to avoid ambiguity.

Total degrees and physics degrees by state

Here I use a CASE to count degrees for two different CIP codes. This may not be best practice, but it works. I wanted to scale by the population of each state, so I had to join three databases. "abbrev" gives me the full state name so I can look it up in "population". I'm using a pre-filtered version of "cips" where I've eliminated all of the specific fields of study (any classification beyond the decimal place) to cut down on query time. I didn't actually need to join directory. Note that CIP code = 99 is used in "completions" to indicate the total among all CIP codes, while CIP = 40 indicates Physics-related studies.

Representation of women across all degrees, and for English language and literature degrees

This is a straightforward adaptation of the above. I want the proportion of degrees awarded to women, so I take the ratio ctotalw/ctotalt, and only count it when my conditions are met.

Representation of women by general area of study

Here, I just need to avoid counting the "total" amounts for each institution (indicated by cipcode = 99). The symbol "<>" means "not equal to."

Population vs. degrees awarded for each state

Here I need four tables. The population requires "population" and "abbrev" to get state names, and total degrees by state requires "completions," and "directory" gives me the state in which each institution resides. I include the filtered version of "cips" here because it was easier than deleting it.

Population vs. professional doctorates for each state

Same, but with the requirement that "award_lvl = 17." The award levels can be found here. 17 excludes professional degrees like MD's.

Optics PhD's by institution

A subject near to my heart, and an easy query. Instead of searching for the CIP codes corresponding to optics, I search for the word "optical" in the CIP title. This is in general a terrible strategy, but I checked ahead of time that this gets everything I want and nothing I don't.

Saturday, August 15, 2015

Code snippets in Blogger

I'm going to be including some code blocks soon, and I found this quick method. You just create a document on gist.github.com with the appropriate extension, and it generates a link to embed the code block. Here are some examples:

Python:

SQL:

C++:

Hopefully next time I'll have some SQL results to share.

Wednesday, August 12, 2015

Setting up a development environment

I've mostly used my work machine for data science training, since it had many of the utilities I needed already, and also it was a computation beast. But I wanted to get my Macbook Pro into a condition to be my main development machine, since (1) it's portable, and (2) I don't work there no more.

For the most part, I followed steps 2-8 of the tutorial here to set up the environment, skipping the system preferences and stopping before installing C++, Ruby, and other developer software. I'll get there eventually, but I'm taking baby steps. I also already have LaTeX running on the Macbook, so I didn't need that.

Step 6 is the installation of Sublime Text, which is a pretty rad text editor. You can use the "evaluation version" for free forever, or you can pay the $70 to register it. It does a lot of cool stuff, including automatically setting the syntax for whatever programming language you're using. For example, if I create a new file and save it with the extension *.py, I can create a function by typing "def", and Sublime Text fills in the following:

Everything has been marked up in color, and the proper function syntax has been filled in. The tutorial includes the installation of various packages that set the color scheme, as well as mark up possible errors in the program. The red block indicates a perceived error, where it's worried that there's a tab in the line (there is, but it's equivalent to four spaces, which is the correct syntax).

You can also run programs from Sublime Text using its console:

Note the bit at the bottom. Once I wrote the function and called it, I used Sublime Text's "build" command with Python.

Step 8 in the tutorial involves the installation of MySQL, which I'll talk more about in a future post. Today I went through a couple of tutorials, and to be honest, it took me a while to get it working. In short, SQL is a language for interacting with databases. To use it, you need some databases, which are typically stored on a "server," which can just be your computer. There are several programs that can be used to maintain and interact with these database servers, and MySQL is one of them. It's open source, and the community edition is free. Here is an SQL tutorial, and here is another one.

I think I have the essentials set up now. My original goal today was to learn the basics of SQL and start a mini-project. Hopefully I'll get to the mini-project this weekend.

Sunday, August 9, 2015

Insight

I'm excited to announce that I recently accepted an invitation to be an Insight Data Science Fellow in New York City. I want to tell you a bit about the application and interview process in case you find yourself in a similar situation. But first, here's a bit about the program.

The Insight Fellowship

The Insight program is intended for people with PhD's in any field, and who intend to become data scientists in New York City or the SF Bay Area. Over the course of seven weeks, fellows put together an individual project that showcases their data science abilities, and the Insight crew helps prepare them for interviews. Starting in week eight, interviews begin. As of right now, with 300 alumni, the insight program has a 100% placement rate.

It's free of charge, since they work on something like a recruiter model. They have corporate partners that come to them for recruits, and Insight gets paid if and when a fellow accepts a job offer. I don't know the details, but if it's similar to other recruiters, the amount of the payment is a percentage of the base salary accepted, so they have an incentive to help the fellows negotiate well.

The program is selective, and involves an application and an interview. I heard from one of the Insight directors (from the SF Bay area), Kathy Copic, that the acceptance rate is something like 7%. They're looking for people who have an excellent chance of getting placed at a company, since that's the only way they get paid. Here's how the process works.

The application

First you complete a web form with basic biographical information and a few short answer blocks. They want to know what programming languages you know, and how familiar you are with each. They ask about what kinds of data you have worked with, and what types of analyses you've done, including side projects. Finally, there's a space to explain why you want to be a data scientist. Each of these is limited to around one paragraph.

The interview

If you make the first cut, a Skype interview is scheduled, and lasts about a half hour. They request that you prepare a short "demo" showing they types of data analysis you typically do. My work is mostly done in MATLAB, but I wanted to show that I was familiar with Python, so I did a short project the weekend before the interview, which resulted in the previous blog posts on Twitter data mining.

The interview is extremely relaxed. They say specifically that you're not required to dress up for it, and it's really just a conversation with one of the directors. I spoke with Genevieve, who told me about her career trajectory (she was accepted into the Insight program after her PhD, and was then invited to stay as a director). I talked a bit about my background in nano optics and computational/experimental optics, and about the experimental data I tend to collect.

For the demo, I shared my desktop and showed Genevieve some example MATLAB code that performed some basic noise analysis. I had written it a couple of weeks before to characterize an experimental system. Then I showed the Python code and the results of the Twitter experiment. We talked a little bit about what companies I would be interested in working with, and why I wanted to be a data scientist. At the end she answered a couple of my questions before we wrapped up.

Interviews spanned 11 days. I heard back about my acceptance on the 12th day. I don't know what the acceptance rate was this year, or even how many colleagues will join me as Insight Fellows. I also don't know what percentage of applicants made the first cut but not the second. But hopefully this post will help an applicant for a future session.

The interim

I just moved to NYC with my family, and will begin the program in about a month. In the mean time I'm going to continue familiarizing myself with the basic tools of the data scientist, and getting used to the area. I can tell you so far that the bagels are excellent, and even the "bad" Chinese food is pretty good compared to that in Central Illinois.

Thursday, July 23, 2015

10 tips for 10-minute presentations

One of the things I love most in life is helping people improve their academic talks. My advice is pretty consistent, and I thought it would be helpful to gather it here. Specifically, I've been thinking about ten-minute presentations, are often misunderstood. Below is some advice, and a video of an example talk.

The problem with short talks

There’s a lot of room for error in an hour-long talk. You have time to build a rapport with the audience, reacting to confused or sleeping faces by changing your pace or giving some extra background information. According to the peak-end principle, the audience will mostly remember the best part of the presentation and the end of it, so if you nail those you’ll be fine. People may also be there specifically to see you, so they start out primed to listen carefully and make an effort to understand.

Not so for a ten-minute presentation: you get once chance to grab and keep the attention of an audience who might rather be playing with their phones.

In other words, you have to be better than Candy Crush.

If you only have ten minutes to speak, it’s probably because the organizers had to squeeze in too many talks into too short a time period. People will be tired of listening, will be thinking about their own talk, and will probably not have expertise in your field. So all of the decisions we make will revolve around three principles:

Give the audience an incentive to pay attention by being interesting or entertaining.
Don't make the audience work hard to follow your talk, or require prior specialized knowledge.
Make the most of your ten minutes by trimming anything unnecessary.

These principles apply in longer talks as well, but they're absolutely critical for short ones.

Here are some tips for putting these principles into practice:

A 10-minute talk is not a shorter or faster version of a long talk. You can tell exactly one short story in ten minutes. Decide on the one thing you want your audience to take home with them, and write it down in a sentence. Don’t be afraid to repeat this sentence more than once in your talk. Your job is to provide just barely enough context to understand that story, and then tell it well. And no matter how slowly you think you're talking, you're talking faster than that. Slow down.
The audience can either listen or read, but not both at the same time. Mostly you want them listening, so eliminate as much text as you can. If you’re afraid you’ll forget to say something, put it in the presenter notes. It can help to think of the slides as being there for the presenter’s benefit: they jog your memory so you can remember what comes next in the story.
You will have the audience’s full attention when the talk starts. You will lose it immediately if you have a bad cover slide. Put the effort into making it tasteful but eye-catching, and spend time talking with that slide shown. That will buy you an extra minute of attention while you set up your story. Also, shorter titles are better. They're easy to understand, and the audience can listen to you instead of reading it.
Don't use an outline slide. Outlines are useful when you’re going to talk about several topics and you don’t want your audience to get lost. In a short talk, you won't need to organize a complicated story, so outlines just eat the time you could be using to tell a simple one.
Include animation only when it helps your audience follow the message. Make text come up a line at a time when you don’t want people reading ahead. That goes for figures too. You don't want the audience trying to digest your slide instead of listening to you.
If you ask your audience to read text, make it easy on them. Using no font smaller than size 30 will not only guarantee readability, but will force you to limit the total amount of text on each slide (this includes chart labels!). If you’re still using Comic Sans… don't.

6.1: Related note on equations: include them only if they make the talk easier to follow. Equations are compact, efficient sentences that can be read in English. Being compact, they're difficult to unpack in real time. Don’t make the audience work that hard. Help them understand the symbols, and make it crystal clear why they lead to better understanding of the material.
Spend the time and effort to make beautiful figures, and emphasize them instead of text. Help your audience understand them by pointing, and literally telling them "look over here." If you have graphs, always identify the axes out loud and teach people how to read them. Point out the important features. As usual, we want people listening instead of parsing data.
End by saying “That concludes my talk. Thank you for your attention.” Don’t read your acknowledgements out loud (it eats time and gives the audience a chance to forget what they were going to ask), and don’t ask for questions (only the moderator knows how much time there is for questions).
Make extensive use of supplementary slides. Paste entire presentations into the supplemental section, and load it up with equations and text. It doesn’t have to be pretty. This is your security blanket. If someone asks a question you can’t answer easily, you want to be able to find the answer in your supplementary slides. You’ll look like a genius for having thought ahead.
Practice, but not to memorize your lines. Practice so you can find out what works and what doesn't, what's clear and what's not, and what you can safely trim away from the talk. Find the clunky transitions or a graphs that take too long to explain. There's no reason for a ten-minute talk not to sound smooth, relaxed, and well-oiled. If you can't get it to that point, you're trying to say too much.

Example talk

I made this talk as an example of how to implement some of the above suggestions, using a recent Python/Twitter project as subject material. It's not perfect (there was no easy way to record a pointer, for example), but hopefully you'll find it useful anyway.

Sunday, July 19, 2015

Data mining Twitter [with code]

I recently applied to the Insight Data Science Fellowship, and was invited to do a short Skype interview. The interview includes a short demo, which is supposed to show them what kinds of data I work with, and highlight some of the skills I bring to the data science table. Since I mostly work with MATLAB, I wanted to do a mini-project emphasizing some more relevant skills.

I got the email on Thursday with the interview scheduled the following Monday morning, so I needed something I could do over the weekend. Data mining Twitter seemed like a good option, since I could do it in Python (highly relevant for data science), it’s “real” data (as opposed to experimental data, I guess), and it lends itself to varied analysis including statistics and language processing. I just had to pick something to track.

Tracking #NorthFire

Friday night, there was a wildfire near Los Angeles, in the Cajon Pass. The fire jumped the highway, and about 20 cars burned. The hashtag #Northfire started trending, and I tracked it for about 14 hours, creating a database of about 5500 tweets, taking up about 32 megabytes of space.

First I’ll show the results, and then I’ll go into detail about the analysis. I’m also going to include the code I used, in case it's helpful to anyone. I made this graphic summarizing the analysis:

Summary of analysis for #NorthFire tracking over 14 hours. Thanks to my talented friend and freelance graphic artist Bethany Beams for helping me with this. She looked at my first draft and gave me some tips that improved readability substantially.

What did I find?

There’s not too much that’s surprising here. People were using the word “fire” a lot to describe the fire. Popular tweets include comparisons to Armageddon and references to exploding trucks. Standard, IMO. Tweets became less frequent as the fire raged on and people went to bed, and then picked up again when people started waking up and reading the news. One interesting thing is the popularity of the word “drone.” It turns out that some hobbyists had flown some drones in to get a closer look at the fire, which prevented the helicopters from dropping water. That’s why it’s important not to have a hobby.

Details on collection and analysis

I followed this wonderful tutorial to collect the tweets and perform some of the analysis. Collecting tweets basically involves:

Registering an app with Twitter, which gives you access to their API
Using Python to log on with your authentication details
Using a package called “Tweepy” to open a stream and filter for a particular hashtag
Saving tweets to file in the right format

Anatomy of a Tweet

A tweet is an ugly object. If you want to know how the sausage is made, look here:

{ "contributors": null, "coordinates": null, "created_at": "Wed May 27 03:39:12 +0000 2015", "entities": { "hashtags": [ { "indices": [ 98, 114 ], "text": "machinelearning" }, { "indices": [ 115, 126 ], "text": "regression" } ], "symbols": [], "urls": [ { "display_url": "physicstodata.blogspot.com/2015/05/where-\u2026", "expanded_url": "http://physicstodata.blogspot.com/2015/05/where-does-logistic-regression-come-from.html", "indices": [ 75, 97 ], "url": "http://t.co/bgNbHJYbAt" } ], "user_mentions": [] }, "favorite_count": 3, "favorited": false, "geo": null, "id": 603405063856496640, "id_str": "603405063856496640", "in_reply_to_screen_name": null, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "is_quote_status": false, "lang": "en", "place": null, "possibly_sensitive": false, "retweet_count": 3, "retweeted": false, "source": "Twitter Web Client", "text": "When and why do we use logistics instead of other sigmoids for regression? http://t.co/bgNbHJYbAt #machinelearning #regression", "truncated": false, "user": { "contributors_enabled": false, "created_at": "Thu Feb 16 22:36:07 +0000 2012", "default_profile": true, "default_profile_image": false, "description": "Scientist, gamer, aspiring rationalist.", "entities": { "description": { "urls": [] } }, "favourites_count": 64, "follow_request_sent": false, "followers_count": 35, "following": false, "friends_count": 107, "geo_enabled": false, "has_extended_profile": false, "id": 494451363, "id_str": "494451363", "is_translation_enabled": false, "is_translator": false, "lang": "en", "listed_count": 3, "location": "Champaign, IL", "name": "Bradley Deutsch", "notifications": false, "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": false, "profile_image_url": "http://pbs.twimg.com/profile_images/1986033048/brad1_normal.png", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1986033048/brad1_normal.png", "profile_link_color": "0084B4", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "protected": false, "screen_name": "BradleyDeutsch", "statuses_count": 166, "time_zone": null, "url": null, "utc_offset": null, "verified": false } }

It’s a database entry with the text of the tweet, the time it was created, a list of everyone involved, and about 30 other things I didn’t care about. The saved file is in JSON format, which is convenient for data science. Funny story, Twitter automatically supplies the tweets in this format, but Tweepy reformats them, so you have to manually change them back. Thanks, Tweepy.

Counting Tweets and Retweets

The next step was to count the number of originals and retweets, and save a new data file containing only the originals. This was important for the language analysis I wanted to do: having 600 retweets would seriously throw off the statistics. To find the retweets, I just looked in the text of the tweet, where retweets always begin with “RT”.

Most Common Words

I then used the file with the original tweets to track the most common words and bigrams. First, the text of each tweet has to be tokenized, where we parse the string of text into words and symbols. It’s also prudent to ignore “stop words” like “the” and “a,” and punctuation. Python has a natural language toolkit that makes all of this pretty easy to do. Again, I used this tutorial.

Most Retweeted

Finding the most-retweeted tweets (getting tired of typing “tweets”) is similarly straightforward. I found some code here, but basically it just looks up the number of retweets for each tweet, puts them in order, and prints a list. You can set the minimum number of retweets and the number it displays.

Tweet Frequency Chart

The final thing I wanted to do was track the tweet frequency as a function of time. Each tweet contains a timestamp that reports the year, month, day, hour, minute, and second. I converted that to seconds using very straightforward code adapted from this page, and then saved all of the timestamps to a text file. I used MATLAB’s “histcounts” function to make a histogram, and plotted the counts as a line using the area plot function. In Adobe Illustrator, I recolored the histogram using the gradient tool.

Code

Here is the code on Github. Everything is a separate file in the interest of coding time. I may go turn each of the files into a function at some point. The important things are:

listen_tweets.py: Stream tweets from Twitter, filtering for a certain string. You have to put in your authentication details, like the consumer key and secret. You get those when you register an app.

discard_RT.py: For each tweet, check if it's a retweet. If not, save it to a file. Count the number of original tweets and retweets.

count_frequencies.py: Tokenize the text from all tweets in a file and find the most common words or bigrams using the natural language toolkit.

retweet_stats.py: List the most common retweets in order.

get_timestamps.py: Convert the "created_at" value from each tweet into seconds, and store all of the values to a .txt file.

That’s it for now. I hope this was helpful. Next time I’ll talk a bit about the interview.