% curl www.google.com/humans.txt
Google is built by a large team of engineers, designers, researchers,
robots, and others in many different sites across the globe. It is
updated continuously, and built with more tools and technologies than
we can shake a stick at. If you'd like to help us out, see
google.com/jobs.
>Unless your website is written in Russian or Chinese, you probably don't get any traffic from them. They mostly just waste bandwidth and consume resources.
THIS is evil. You could use this argument for banning any new search engine.
The problem is that both Yandex and Baidu are rather poorly behaved - they hit your website way too fast, downloading large bandwidth files in quick succession. That's actually what led me to the bad-bot-blocker project in the first place. Baidu has also been accused of not respecting robots.txt though I have not personally observed that.
This is the reason they're blocked, not because they're new or non-English.
What "new search engine" has actually generated actual revenue for any webmaster in the past ten years? You could argue DDG but that's the only one I can think of.
It's blocking by user agent and source ip. You should be able to port the list easily for Nginx. I'd even say you can write a simple awk script in a few minutes to convert from Apache's format to Nginx's.
I don't know if this is universal, but Yandex was quite bad in the past at stumbling around generated links (E.G. calendars that go to infinity) and it wasn't at all unusual to see 400+ links crawled in a day or so. Baidu was the same, but I think they're more well behaved these days.
I IP-banned Chinese search engine crawlers (yandex, baidu etc) because they did not respect the crawl-delay instruction in my robots.txt, causing my low-end crappy php server running vbulletin to bork out. Basically, their bots are too aggressive.
'baraza' is Swahili for forum/meeting place. It was very much Yahoo Answers, targeted at the African market - here in Kenya, most people only have internet access via mobile connections, often using feature phones, hence the minimalistic stylesheet.
Over-simply, because they do not want to indicate that their site is about those pages, for whatever reason.
In practice, I'm not entirely sure but it looks like it's quite old since those pages don't seem to exist anymore, even as folders which don't exist as pages in themselves, and they aren't 301-redirected to the current relevant pages.
In fact, they're all 404's. So perhaps they used to be pages, were deleted, and kept being crawled which made their site look crap (because of the 404's). Now they could use 301's, but I assume that the reason they didn't is because they might want to restructure the site in the future and re-use those pages. They don't use 302's because 302's are unreliable and freaky.
Yes, from the server side's mishandling of TXT extension. Probably the server put the MIME type in the HTTP header as "HTML" instead of TXT, and the browser renders the page as such.
It just issues a "HTTP/1.1 302 Moved Temporarily" directed to their homepage. Requesting an invalid file such as "robots.txtsdfa32r523" has the same effect, so they probably don't have a robots file at all.
The operator of a crawler doesn't need to sign an agreement for the prohibition to be enforceable. See eBay v. Bidder's Edge. This was 14 years ago, folks.
What is the point of having such silly prohibition? It's silly because anyone can crawl it if they want, Facebook may block such DDoS attack, but why would they bother to put up such sign when they know it's useless?
It's probably just their way of explaining how those user-agents that do not get the catchall Disallow: / treatment got into that robots.txt file. Also, including some lawyerisms might be quite effective at reminding upstart scrapers that faking the googlebot UA would be even less cool than simply ignoring robots.txt.
I frequently wonder - is Facebook allowed to say, "bing, you can crawl us. NewCompetitor, you cannot."
I feel like once a company allows public access by posting stuff on the web, they can specify terms, but not include/exclude groups specifically. (In a legal sense; I understand blocking systems that hammer servers but will respect robots.txt. IME bing is the worst offender -- they hammer my sites, send no traffic, but will stop if I specify in robots.txt.)
Does anyone have an opinion about "once public, I can crawl"?
I can think of no reason why there would be any such restriction.
Suppose Facebook is getting paid by Bing, and won't offer crawling to those that aren't paying it? Suppose Facebook considers Baidu's crawler to be evil and chooses to prohibit it for that reason? Suppose Facebook just kind of likes the guys at Bing and decides to allow them special access? If you agree in the first place that Facebook should have the right to put ANY sort of restrictions on who can crawl their side, then why should ANY of these be prohibited? This is not a "common carrier" kind of situation.
That's because it isn't. Or at least not in the traditional sense. It's not just some old bullshit, scripted in PHP, running on an array of scrappy LAMP boxen.
...and yeah, it executes PHP code, for sure. But right there, things are already different, and the reality is that they've written a substantial code base in C/C++.
And, two, I'm sure they retain some serious business proprietary trade secrets about their server infrastructure, meaning that while the web front-end might render out HTML like a souped-up CDN, behind the scenes, there is a shit ton of other stuff going down.
Honestly, I think they just leave the file name extensions in the URL for the sake of nostalgia.
Google will still index pages blocked by robots.txt, it just won't crawl them (so it can't get a description/preview snippet). It indexes them based on the URL and how people link to them.
If you're big enough, doesn't this just make sense? Why waste time maintaining a robots.txt policy when it must represent a tiny fraction of your traffic, which your servers can surely handle? And the really 'bad' guys are going to ignore it anyway. And if you really care, you'll have some much more sophisticated bandwidth throttling in place.
For the smaller guys, sure it makes sense to have some kind of simple robots.txt policy.
Thats not quite true. Apple has multiple subdomains, each for some part of their site. www.apple.com hosts most of the marketing stuff, but have a look at other pages, e.g.: http://store.apple.com/robots.txt