Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Couldn't this be handled in robots.txt?

That would require knowing all the user agents that would scrape content, assuming that you want to only exclude AI scrapers and not search engines in general.

OpenAI's user agent is GPTBot[1], I am not sure about the others.

[1] https://news.ycombinator.com/item?id=37030568



How are you proposing to distinguish between "AI" and "search engines"? Most of the search engines now have a summarizer at the top which is presumably LLM output, and search engines operate on the basis of ML in general.


> Most of the search engines now have a summarizer at the top which is presumably LLM output, and search engines operate on the basis of ML in general.

There is still a difference between scraping content for the purpose of searching it and training on it.


A search engine is an AI model that outputs search results. Creating the index is training it. There is no obvious principled way to distinguish them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: