facebook's... https://www.facebook.com/robots.txt

johnvschmitt · on Feb 23, 2014

That really blows my mind. I mean, how can they say that's any kind of "agreement"?

I someone writes a curl/wget script wrapper & points it to the top 10 websites, they don't enter into any kind of written contract or agreement.

declan · on Feb 23, 2014

The operator of a crawler doesn't need to sign an agreement for the prohibition to be enforceable. See eBay v. Bidder's Edge. This was 14 years ago, folks.

MichaelApproved · on Feb 23, 2014

You're quoting "agreement" as if its literally in their robots file. It's not.

They're telling the public that it does not have permission to crawl the site which try have the right to do. What is the problem with that?

yeukhon · on Feb 23, 2014

What is the point of having such silly prohibition? It's silly because anyone can crawl it if they want, Facebook may block such DDoS attack, but why would they bother to put up such sign when they know it's useless?

MichaelApproved · on Feb 23, 2014

They have such a large network, anything they could do to prevent unwanted crawling is probably helpful.

sfall · on Feb 23, 2014

my guess lawyers

usrusr · on Feb 23, 2014

It's probably just their way of explaining how those user-agents that do not get the catchall Disallow: / treatment got into that robots.txt file. Also, including some lawyerisms might be quite effective at reminding upstart scrapers that faking the googlebot UA would be even less cool than simply ignoring robots.txt.

yeukhon · on Feb 23, 2014

> Notice: Crawling Facebook is prohibited unless you have express written

Wow, really? Who put up this sign?

phinnaeus · on Feb 23, 2014

Someone with 80 character lines enforced in their editor? Here's the second line...

> permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

amaks · on Feb 23, 2014

Bing is allowed. But it's obvious why.

monkeyspaw · on Feb 23, 2014

I frequently wonder - is Facebook allowed to say, "bing, you can crawl us. NewCompetitor, you cannot."

I feel like once a company allows public access by posting stuff on the web, they can specify terms, but not include/exclude groups specifically. (In a legal sense; I understand blocking systems that hammer servers but will respect robots.txt. IME bing is the worst offender -- they hammer my sites, send no traffic, but will stop if I specify in robots.txt.)

Does anyone have an opinion about "once public, I can crawl"?

mcherm · on Feb 23, 2014

I can think of no reason why there would be any such restriction.

Suppose Facebook is getting paid by Bing, and won't offer crawling to those that aren't paying it? Suppose Facebook considers Baidu's crawler to be evil and chooses to prohibit it for that reason? Suppose Facebook just kind of likes the guys at Bing and decides to allow them special access? If you agree in the first place that Facebook should have the right to put ANY sort of restrictions on who can crawl their side, then why should ANY of these be prohibited? This is not a "common carrier" kind of situation.

aw3c2 · on Feb 23, 2014

Do you get a different file than me? No-one but ia_archiver is allowed, see https://pastee.org/zpjsa

Houshalter · on Feb 23, 2014

That's not even a text file.

sebastianavina · on Feb 23, 2014

I really can't believe facebook is written in php...

pigDisgusting · on Feb 23, 2014

That's because it isn't. Or at least not in the traditional sense. It's not just some old bullshit, scripted in PHP, running on an array of scrappy LAMP boxen.

They run their PHP on HHVM for one:

https://en.m.wikipedia.org/wiki/HipHop_for_PHP

https://github.com/facebook/hhvm

...and yeah, it executes PHP code, for sure. But right there, things are already different, and the reality is that they've written a substantial code base in C/C++.

And, two, I'm sure they retain some serious business proprietary trade secrets about their server infrastructure, meaning that while the web front-end might render out HTML like a souped-up CDN, behind the scenes, there is a shit ton of other stuff going down.

Honestly, I think they just leave the file name extensions in the URL for the sake of nostalgia.

Killswitch · on Feb 23, 2014

HipHop is still PHP.

Also their frontend is the only thing written in PHP. You hit the site, you're hitting PHP pages.. Not just an extension for nostalgic reasons.