They do have a robots.txt [1] that disallows robot access to the spigot tree (as...

bstsb · 2025-07-12T09:25:08 1752312308

previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

yorwba · 2025-07-12T11:31:13 1752319873

The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.

evgpbfhnr · 2025-07-12T11:50:12 1752321012

when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..

DamonHD · 2025-07-12T19:57:33 1752350253

Almost no bot responds usefully to 429 that I have seen, and a few respond to it like 500 and 503 to speed up / retry / poll more.

dhosek · 2025-07-13T04:26:41 1752380801

Reminds me of a service I led the development on where we had to provide mocks for the front end to develop against as well as develop against mocks of an external service which wasn’t ready for us to use.

When we finally were able to do an end-to-end test, everything worked perfectly on the first try.

Except, the front end REST library, when given a 401 error when an incorrect auth code was sent, retried the request rather than reporting to the user that there was an error which meant that entering an incorrect auth code would lock the user out of their account immediately.

We ended up having to return all results with a 200 response regardless of the contents because of that broken library.

josephg · 2025-07-12T09:16:36 1752311796

> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

So? What duty do web site operators have to be "nice" to people scraping your website?

gary_0 · 2025-07-12T09:25:24 1752312324

The Marginalia search engine or archive.org probably don't deserve such treatment--they're performing a public service that benefits everyone, for free. And it's generally not in one's best interests to serve a bunch of garbage to Google or Bing's crawlers, either.

marginalia_nu · 2025-07-12T23:57:47 1752364667

It's not really too big of a problem for a well-implemented crawler. You basically need to define an upper bound both in terms of document count and time for your crawls, since crawler traps are pretty common and have been around since the cretaceous.

darkwater · 2025-07-12T19:07:08 1752347228

If you have such a website, then you will just serve normal data. But it seems perfectly legit to serve fake random gibberish from your website if you want to. A human would just stop reading it.

suspended_state · 2025-07-12T09:46:11 1752313571

The point is that not every web crawler is out there to scrape websites.

andybak · 2025-07-12T12:06:06 1752321966

Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?

suspended_state · 2025-07-12T18:03:59 1752343439

I think that web scraping is usually understood as the act of extracting information of a website for ulterior self-centered motives. However, it is clear that this ulterior motive cannot be assessed by a website owner. Only the observable behaviour of a data collecting process can be categorized as morally good or bad. While the bad behaving people are usually also the ones with morally wrong motives, one doesn't entail the other. I chose to qualify the bad behaving ones as scrapers, and the good behaving ones as crawlers.

That being said, the author is perhaps concerned by the growing amount of collecting process, which carries a toll on his server, and thus chose to simply penalize them all.