Massive scrape of the Twitter friend graph

markbao · on Dec 22, 2008

  username: 'theinfo.org' 
  ... the password is the ramanujan taxicab number followed by the word 
  'kennedy', all one word.

Wait, what?

Anyway, it's 1729.

mechanical_fish · on Dec 22, 2008

http://mathforum.org/library/drmath/view/52600.html

reconbot · on Dec 22, 2008

    Authorization Required

    This server could not verify that you are authorized to access the document requested. Either you supplied the 
    wrong credentials (e.g., bad password), or your browser doesn't understand how to supply the credentials required.

    Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

Something doesn't understand how to supply the credentials... its probably me.

kirubakaran · on Dec 23, 2008

If your browser doesn't prompt you for username and password, you can supply them in the address bar:

http://username:password@website.com/

So, in this case:

http://theinfo.org:1729kennedy@infochimp.info/ics/data/arch/...

symptic · on Dec 22, 2008

Excellent! I've met with Philip (I'm redesigning his division's website here at the University of Texas) and he told me about this project. I didn't expect it to be mobilized this soon.

Sounds promising. :)

mattjaynes · on Dec 22, 2008

Awesome - I've been playing with CouchDB and since the raw data is in JSON - gonna try loading this into it and running some experimental map/reduce views for the data. Thanks!

tlrobinson · on Dec 22, 2008

Nice. But 10 million tweets? That's a few days worth, what's the point?

symptic · on Dec 22, 2008

The point is to be a sort of Google algorithm for Twitter. This is plenty of data to at least get a very solid idea of who the top Tweeters are based on their connections, influence, and popularity.

Also, keep in mind he scraped the TOP Twitter users (those with X+ followers). A lot of Twitters tweets likely come from those under that threshold, saving time, storage space, and effort.

InfochimpsFlip · on Dec 23, 2008

There's another batch coming of tweets off the data mining feed. But yeah: the focus here was on the graph structure more than the text. We're also hoping someone pipes up with "oh gee I have 750m tweets archived do you think anyone else wants to look at them?"

petercooper · on Dec 23, 2008

I downloaded it and they date back to 2006. I guess users who only posted a few times have all their old tweets indexed, whereas those with many tweets only have the latest ones in there (i.e. X tweets each collected max).