That would require knowing all the user agents that would scrape content, assuming that you want to only exclude AI scrapers and not search engines in general.
OpenAI's user agent is GPTBot[1], I am not sure about the others.
How are you proposing to distinguish between "AI" and "search engines"? Most of the search engines now have a summarizer at the top which is presumably LLM output, and search engines operate on the basis of ML in general.
> Most of the search engines now have a summarizer at the top which is presumably LLM output, and search engines operate on the basis of ML in general.
There is still a difference between scraping content for the purpose of searching it and training on it.
That would require knowing all the user agents that would scrape content, assuming that you want to only exclude AI scrapers and not search engines in general.
OpenAI's user agent is GPTBot[1], I am not sure about the others.
[1] https://news.ycombinator.com/item?id=37030568