

We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect. The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. These different approaches have different trade-offs. Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.


With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas. The question as to which is the “ best” language for doing data science is a hotly debated topic ( ), with both languages having their pros and cons. At Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.
