Sunday, July 19, 2015

Data mining Twitter [with code]

I recently applied to the Insight Data Science Fellowship, and was invited to do a short Skype interview. The interview includes a short demo, which is supposed to show them what kinds of data I work with, and highlight some of the skills I bring to the data science table. Since I mostly work with MATLAB, I wanted to do a mini-project emphasizing some more relevant skills.

I got the email on Thursday with the interview scheduled the following Monday morning, so I needed something I could do over the weekend. Data mining Twitter seemed like a good option, since I could do it in Python (highly relevant for data science), it’s “real” data (as opposed to experimental data, I guess), and it lends itself to varied analysis including statistics and language processing. I just had to pick something to track.

Tracking #NorthFire

Friday night, there was a wildfire near Los Angeles, in the Cajon Pass. The fire jumped the highway, and about 20 cars burned. The hashtag #Northfire started trending, and I tracked it for about 14 hours, creating a database of about 5500 tweets, taking up about 32 megabytes of space.

First I’ll show the results, and then I’ll go into detail about the analysis. I’m also going to include the code I used, in case it's helpful to anyone. I made this graphic summarizing the analysis:

Summary of analysis for #NorthFire tracking over 14 hours. Thanks to my talented friend and freelance graphic artist Bethany Beams for helping me with this. She looked at my first draft and gave me some tips that improved readability substantially.

What did I find?

There’s not too much that’s surprising here. People were using the word “fire” a lot to describe the fire. Popular tweets include comparisons to Armageddon and references to exploding trucks. Standard, IMO. Tweets became less frequent as the fire raged on and people went to bed, and then picked up again when people started waking up and reading the news. One interesting thing is the popularity of the word “drone.” It turns out that some hobbyists had flown some drones in to get a closer look at the fire, which prevented the helicopters from dropping water. That’s why it’s important not to have a hobby.

Details on collection and analysis

I followed this wonderful tutorial to collect the tweets and perform some of the analysis. Collecting tweets basically involves:
  1. Registering an app with Twitter, which gives you access to their API
  2. Using Python to log on with your authentication details
  3. Using a package called “Tweepy” to open a stream and filter for a particular hashtag
  4. Saving tweets to file in the right format

Anatomy of a Tweet

A tweet is an ugly object. If you want to know how the sausage is made, look here:



It’s a database entry with the text of the tweet, the time it was created, a list of everyone involved, and about 30 other things I didn’t care about. The saved file is in JSON format, which is convenient for data science. Funny story, Twitter automatically supplies the tweets in this format, but Tweepy reformats them, so you have to manually change them back. Thanks, Tweepy.

Counting Tweets and Retweets

The next step was to count the number of originals and retweets, and save a new data file containing only the originals. This was important for the language analysis I wanted to do: having 600 retweets would seriously throw off the statistics. To find the retweets, I just looked in the text of the tweet, where retweets always begin with “RT”.

Most Common Words

I then used the file with the original tweets to track the most common words and bigrams. First, the text of each tweet has to be tokenized, where we parse the string of text into words and symbols. It’s also prudent to ignore “stop words” like “the”  and “a,” and punctuation. Python has a natural language toolkit that makes all of this pretty easy to do. Again, I used this tutorial.

Most Retweeted

Finding the most-retweeted tweets (getting tired of typing “tweets”) is similarly straightforward. I found some code here, but basically it just looks up the number of retweets for each tweet, puts them in order, and prints a list. You can set the minimum number of retweets and the number it displays.

Tweet Frequency Chart

The final thing I wanted to do was track the tweet frequency as a function of time. Each tweet contains a timestamp that reports the year, month, day, hour, minute, and second. I converted that to seconds using very straightforward code adapted from this page, and then saved all of the timestamps to a text file. I used MATLAB’s “histcounts” function to make a histogram, and plotted the counts as a line using the area plot function. In Adobe Illustrator, I recolored the histogram using the gradient tool.

Code

Here is the code on Github. Everything is a separate file in the interest of coding time. I may go turn each of the files into a function at some point. The important things are:

listen_tweets.py: Stream tweets from Twitter, filtering for a certain string. You have to put in your authentication details, like the consumer key and secret. You get those when you register an app.

discard_RT.py: For each tweet, check if it's a retweet. If not, save it to a file. Count the number of original tweets and retweets.

count_frequencies.py: Tokenize the text from all tweets in a file and find the most common words or bigrams using the natural language toolkit.

retweet_stats.py: List the most common retweets in order.

get_timestamps.py: Convert the "created_at" value from each tweet into seconds, and store all of the values to a .txt file.


That’s it for now. I hope this was helpful. Next time I’ll talk a bit about the interview.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. (the first comment didn't go through for some reason, let me try again)

    Hi, nice writeup and thanks for the nice reference -- and all the best for your career transition.

    One suggestion would be to have a look into numpy and friends (pandas, scipy, scikit-learn, etc.), since the solution I've published doesn't scale very well beyond toy data sets, especially the co-occurrence matrix based on Python dictionary... (it's good as an introductory tutorial - I'll write about numpy at some point). Numpy arrays offer great performance and together with the pydata stack are an essential skill if you're going for the Python path.

    Feel free to ping me on linkedin if you'd like to connect ;)

    ReplyDelete
    Replies
    1. Wonderful, I look forward to reading it! Thanks for all your work on the tutorial.

      Delete
  3. Hello,
    The Article on Data mining Twitter is nice.It give detail information about it .Thanks for Sharing the information about Data Mining Twitter. big data scientist

    ReplyDelete