Google's robots.txt

0x0 · on Feb 23, 2014

Curious as to why someone sat down and added this line to that file:

  Allow: /maps?hq=http://maps.google.com/help/maps/directions/biking/mapleft.kml&ie=UTF8&ll=37.687624,-122.319717&spn=0.346132,0.727158&z=11&lci=bike&dirflg=b&f=d

ZirconCode · on Feb 23, 2014

https://www.google.com/search?num=100&site=&source=hp&q=http...

Turns up three results in google. Very weird indeed.

joliv · on Feb 23, 2014

Here's where that link goes to: http://i.imgur.com/CmdBeQy.png

ryanpetrich · on Feb 23, 2014

See also: http://www.google.com/humans.txt

Houshalter · on Feb 23, 2014

This doesn't work if you are using HTTPS everywhere.

ushi · on Feb 23, 2014

  % curl www.google.com/humans.txt
  Google is built by a large team of engineers, designers, researchers,
  robots, and others in many different sites across the globe. It is
  updated continuously, and built with more tools and technologies than
  we can shake a stick at. If you'd like to help us out, see
  google.com/jobs.

dfc · on Feb 23, 2014

Weird, it works for me. Iceweasel Aurora / HTTPS Everywhere 4.0-dev

daGrevis · on Feb 23, 2014

I second this. Latest Chromium.

jemfinch · on Feb 26, 2014

Fixed.

khyh · on Feb 23, 2014

How did you find this?

eli · on Feb 23, 2014

It's a thing. http://humanstxt.org/

riffraff · on Feb 23, 2014

FWIW, I keep finding it in my webserver logs.

Apparently some security scanner tests some php vulnerability based on fopen trying to load that.

logotype · on Feb 23, 2014

Mine is more awesome: http://logotype.se/robots.txt

jonty · on Feb 23, 2014

Oh wow, I didn't realise my robots.txt joke had spread!

I added that to the last.fm/robots.txt many years ago (http://www.wired.com/business/2010/08/robot-laws/all/), and have just been made aware that it appears to have spread across the internet:

https://www.google.co.uk/search?safe=off&q="Disallow%3A+%2Fi...

Favourite sites I've found with it so far: php.net, princeton.edu, nest.com, songkick.com.

russellbeattie · on Feb 23, 2014

Heh, if only Yandex and Baidu respected robots.txt.

agwa · on Feb 23, 2014

I've found this GitHub project to be an invaluable resource for blocking bad bots: https://github.com/bluedragonz/bad-bot-blocker

anonymfus · on Feb 23, 2014

>Unless your website is written in Russian or Chinese, you probably don't get any traffic from them. They mostly just waste bandwidth and consume resources.

THIS is evil. You could use this argument for banning any new search engine.

agwa · on Feb 23, 2014

The problem is that both Yandex and Baidu are rather poorly behaved - they hit your website way too fast, downloading large bandwidth files in quick succession. That's actually what led me to the bad-bot-blocker project in the first place. Baidu has also been accused of not respecting robots.txt though I have not personally observed that.

This is the reason they're blocked, not because they're new or non-English.

ScottWhigham · on Feb 23, 2014

What "new search engine" has actually generated actual revenue for any webmaster in the past ten years? You could argue DDG but that's the only one I can think of.

joveian · on Feb 23, 2014

I think DDG primarily uses Yandex. They at least often put a "Powered by Yandex" logo along the side.

bmelton · on Feb 23, 2014

That's amazing. Know of anything like it for Nginx?

anon4 · on Feb 23, 2014

It's blocking by user agent and source ip. You should be able to port the list easily for Nginx. I'd even say you can write a simple awk script in a few minutes to convert from Apache's format to Nginx's.

coolj · on Feb 23, 2014

curl -s https://raw.github.com/bluedragonz/bad-bot-blocker/master/.h... | awk -F\" '/SetEnvIfNoCase/ {pattern = pattern $2 "|"} END {print "if ($http_user_agent ~ (" substr(pattern, 0, length(pattern)-1) ")) { return 403; }"}'

Not tested. :)

kercker · on Feb 23, 2014

I'm sure that Baidu respects robots.txt.

This site coolshell.cn is boycotting baidu, so it tells Baiduspider that it doesn't want to be indexed by baidu.

Baidu respects this and doesn't index anything from this website.

anonymfus · on Feb 23, 2014

Yandex's documentation about implemented robots.txt features:

http://help.yandex.com/webmaster/controlling-robot/robots-tx...

Yandex's online tool to check specific URL on allowness in robots.txt:

http://webmaster.yandex.com/robots.xml

wfn · on Feb 23, 2014

Along similar lines, personally I was fascinated with https://bridges.torproject.org/robots.txt :)

anonymfus · on Feb 23, 2014

Why you consider Yandex evil?

eksith · on Feb 23, 2014

I don't know if this is universal, but Yandex was quite bad in the past at stumbling around generated links (E.G. calendars that go to infinity) and it wasn't at all unusual to see 400+ links crawled in a day or so. Baidu was the same, but I think they're more well behaved these days.

Moru · on Feb 23, 2014

Lucky you with only 400+ per day. We had thousands per hour. Still the server survived but it wasn't fast...

Cthulhu_ · on Feb 23, 2014

I IP-banned Chinese search engine crawlers (yandex, baidu etc) because they did not respect the crawl-delay instruction in my robots.txt, causing my low-end crappy php server running vbulletin to bork out. Basically, their bots are too aggressive.

tbirdz · on Feb 23, 2014

Yandex is Russian, not Chinese.

drdeadringer · on Feb 23, 2014

Is it analogue nuclear?

vladtaltos · on Feb 23, 2014

Why evil yandex?

catmanjan · on Feb 23, 2014

What a weird entry...

https://www.google.com/maps?hq=http://maps.google.com/help/m...

hayksaakian · on Feb 23, 2014

is it a map focused on SF that highlights bikings paths?

catmanjan · on Feb 23, 2014

Looks like it? Very weird

_pghu · on Feb 23, 2014

http://www.google.com/baraza/en

What a weird little product. It's like Yahoo Answers, but somehow with even less sorting or categorization.

lesiki · on Feb 23, 2014

'baraza' is Swahili for forum/meeting place. It was very much Yahoo Answers, targeted at the African market - here in Kenya, most people only have internet access via mobile connections, often using feature phones, hence the minimalistic stylesheet.

Baraza never really took off.

More about it here: http://whiteafrican.com/2010/10/05/google-baraza-qa-for-afri...

gillis · on Feb 23, 2014

http://www.google.com/baraza/en/help?file=inboundsms

They have an SMS companion service that only works in Ghana..

_7h4m · on Feb 23, 2014

Amazingly, the questions seem to be even dumber than the ones found on YA:

- why is computer an idiot machine

- i casted a love spell on my ex,should i tell her now that she is back

- What is the colour of the black box which is using in planes ?

d99kris · on Feb 23, 2014

Glassdoor uses its for recruiting: http://www.glassdoor.com/robots.txt

avar · on Feb 23, 2014

I love the note at the end of that file about how they can't figure out their own web framework.

coin · on Feb 23, 2014

Sadly they use Jobvite. Its usability is about as bad as Lotus Notes.

te_chris · on Feb 23, 2014

Why would they disallow all their about content? (Bit of an SEO noob).

JimmyM · on Feb 23, 2014

Over-simply, because they do not want to indicate that their site is about those pages, for whatever reason.

In practice, I'm not entirely sure but it looks like it's quite old since those pages don't seem to exist anymore, even as folders which don't exist as pages in themselves, and they aren't 301-redirected to the current relevant pages.

In fact, they're all 404's. So perhaps they used to be pages, were deleted, and kept being crawled which made their site look crap (because of the 404's). Now they could use 301's, but I assume that the reason they didn't is because they might want to restructure the site in the future and re-use those pages. They don't use 302's because 302's are unreliable and freaky.

Does that sound right to everyone else?

mattmanser · on Feb 23, 2014

These guys are the best white-hat SEO growth hackers on the planet, just copy them and don't question it.

JimmyM · on Feb 23, 2014

Oh, I wasn't questioning them - just asking if my assessment sounded right.

blossoms · on Feb 23, 2014

Unrelated but it looks like www.aol.com's robots.txt is served as text/html http://www.aol.com/robots.txt

Is this a common mistake?

saltysugar · on Feb 23, 2014

Yes, from the server side's mishandling of TXT extension. Probably the server put the MIME type in the HTTP header as "HTML" instead of TXT, and the browser renders the page as such.

pjscott · on Feb 23, 2014

From the headers:

    Content-Type: text/html;charset=UTF-8

It also tries to set no less than four cookies.

mkonecny · on Feb 23, 2014

It just issues a "HTTP/1.1 302 Moved Temporarily" directed to their homepage. Requesting an invalid file such as "robots.txtsdfa32r523" has the same effect, so they probably don't have a robots file at all.

blossoms · on Feb 23, 2014

http://i.imgur.com/Jmjf3y2.png

judk · on Feb 23, 2014

Huh? No, it is a regular robots.txt file

tonyedgecombe · on Feb 23, 2014

It redirects requests from the UK.

yRetsyM · on Feb 23, 2014

I remember when this used to be a source of product leaks.

zeckalpha · on Feb 23, 2014

And now it is an archive of discontinued products.

plucas · on Feb 23, 2014

Yelp's is fun: https://www.yelp.com/robots.txt

wupiass · on Feb 23, 2014

facebook's...

https://www.facebook.com/robots.txt

johnvschmitt · on Feb 23, 2014

That really blows my mind. I mean, how can they say that's any kind of "agreement"?

I someone writes a curl/wget script wrapper & points it to the top 10 websites, they don't enter into any kind of written contract or agreement.

declan · on Feb 23, 2014

The operator of a crawler doesn't need to sign an agreement for the prohibition to be enforceable. See eBay v. Bidder's Edge. This was 14 years ago, folks.

MichaelApproved · on Feb 23, 2014

You're quoting "agreement" as if its literally in their robots file. It's not.

They're telling the public that it does not have permission to crawl the site which try have the right to do. What is the problem with that?

yeukhon · on Feb 23, 2014

What is the point of having such silly prohibition? It's silly because anyone can crawl it if they want, Facebook may block such DDoS attack, but why would they bother to put up such sign when they know it's useless?

MichaelApproved · on Feb 23, 2014

They have such a large network, anything they could do to prevent unwanted crawling is probably helpful.

sfall · on Feb 23, 2014

my guess lawyers

usrusr · on Feb 23, 2014

It's probably just their way of explaining how those user-agents that do not get the catchall Disallow: / treatment got into that robots.txt file. Also, including some lawyerisms might be quite effective at reminding upstart scrapers that faking the googlebot UA would be even less cool than simply ignoring robots.txt.

yeukhon · on Feb 23, 2014

> Notice: Crawling Facebook is prohibited unless you have express written

Wow, really? Who put up this sign?

phinnaeus · on Feb 23, 2014

Someone with 80 character lines enforced in their editor? Here's the second line...

> permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php

amaks · on Feb 23, 2014

Bing is allowed. But it's obvious why.

monkeyspaw · on Feb 23, 2014

I frequently wonder - is Facebook allowed to say, "bing, you can crawl us. NewCompetitor, you cannot."

I feel like once a company allows public access by posting stuff on the web, they can specify terms, but not include/exclude groups specifically. (In a legal sense; I understand blocking systems that hammer servers but will respect robots.txt. IME bing is the worst offender -- they hammer my sites, send no traffic, but will stop if I specify in robots.txt.)

Does anyone have an opinion about "once public, I can crawl"?

mcherm · on Feb 23, 2014

I can think of no reason why there would be any such restriction.

Suppose Facebook is getting paid by Bing, and won't offer crawling to those that aren't paying it? Suppose Facebook considers Baidu's crawler to be evil and chooses to prohibit it for that reason? Suppose Facebook just kind of likes the guys at Bing and decides to allow them special access? If you agree in the first place that Facebook should have the right to put ANY sort of restrictions on who can crawl their side, then why should ANY of these be prohibited? This is not a "common carrier" kind of situation.

aw3c2 · on Feb 23, 2014

Do you get a different file than me? No-one but ia_archiver is allowed, see https://pastee.org/zpjsa

Houshalter · on Feb 23, 2014

That's not even a text file.

sebastianavina · on Feb 23, 2014

I really can't believe facebook is written in php...

pigDisgusting · on Feb 23, 2014

That's because it isn't. Or at least not in the traditional sense. It's not just some old bullshit, scripted in PHP, running on an array of scrappy LAMP boxen.

They run their PHP on HHVM for one:

https://en.m.wikipedia.org/wiki/HipHop_for_PHP

https://github.com/facebook/hhvm

...and yeah, it executes PHP code, for sure. But right there, things are already different, and the reality is that they've written a substantial code base in C/C++.

And, two, I'm sure they retain some serious business proprietary trade secrets about their server infrastructure, meaning that while the web front-end might render out HTML like a souped-up CDN, behind the scenes, there is a shit ton of other stuff going down.

Honestly, I think they just leave the file name extensions in the URL for the sake of nostalgia.

Killswitch · on Feb 23, 2014

HipHop is still PHP.

Also their frontend is the only thing written in PHP. You hit the site, you're hitting PHP pages.. Not just an extension for nostalgic reasons.

ankurpatel · on Feb 23, 2014

Yelp has rules set for the robots: http://www.yelp.com/robots.txt

Theodores · on Feb 23, 2014

There are only 602 pages of Google.com indexed on Google.com, mostly 'plus' profiles. Quite a few show with this message:

A description for this result is not available because of this site's robots.txt – learn more.

Which is odd.

eli · on Feb 23, 2014

Google will still index pages blocked by robots.txt, it just won't crawl them (so it can't get a description/preview snippet). It indexes them based on the URL and how people link to them.

Istof · on Feb 23, 2014

404 for http://www.gstatic.com/trends/websites/sitemaps/sitemapindex...

oevi · on Feb 23, 2014

Wikipedia's robots.txt is quite verbose: http://en.wikipedia.org/robots.txt

iso8859-1 · on Feb 23, 2014

Because it's part of the Wiki: https://en.wikipedia.org/w/index.php?title=MediaWiki:Robots....

MattHeard · on Feb 23, 2014

My favourite part:

> # Folks get annoyed when XfD discussions end up the number 1 google hit for

> # their name.

codr · on Feb 23, 2014

Apple allows robots access to everything! http://www.apple.com/robots.txt

oneeyedpigeon · on Feb 23, 2014

If you're big enough, doesn't this just make sense? Why waste time maintaining a robots.txt policy when it must represent a tiny fraction of your traffic, which your servers can surely handle? And the really 'bad' guys are going to ignore it anyway. And if you really care, you'll have some much more sophisticated bandwidth throttling in place.

For the smaller guys, sure it makes sense to have some kind of simple robots.txt policy.

keule · on Feb 23, 2014

Thats not quite true. Apple has multiple subdomains, each for some part of their site. www.apple.com hosts most of the marketing stuff, but have a look at other pages, e.g.: http://store.apple.com/robots.txt

datawander · on Feb 23, 2014

Actually, robot traffic accounts for a much larger share of traffic for any of the big websites than you think.

techaddict009 · on Feb 23, 2014

Checkout youtube.com/robots.txt !

dabit · on Feb 23, 2014

For the lazy: http://www.youtube.com/robots.txt

Mindless2112 · on Feb 23, 2014

And for those who don't know the reference: http://www.youtube.com/watch?v=WGoi1MSGu64

cordite · on Feb 23, 2014

A lot of these are 404's. It kinda makes me wonder what was behind things like /c/

0_o · on Feb 23, 2014

taobao.com has the shortest robots.txt http://www.taobao.com/robots.txt