so basically, these HTML abominations should not get indexed if google would follow these indexing directives (basically google invented these meta tags themselves)
google is evil? nope - they really follow these directives.
the robots.txt is a crawling directive, google can't crawl the (current) version of these pages, so google doesn't see the indexing directive. but as crawling is optional for indexing URLs, this gets indexed.
how could this be solved, well: either get rid of the robots.txt or
User-Agent: *
Disallow: /
Noindex: /
the noindex robots.txt directive is specified nowhere, but it works nonetheless.
Can you elaborate on how crawling is optional for indexing? Isn't crawling a prerequisite to indexing?
The only exceptions I can think of are scary, like operating a caching proxy and scraping the cached data. Or scraping data from browsers that have loaded pages by user request.
Ohh, got it, thank you. So Google is aware that the URL exists, even though they know nothing about the content served at that URL.
I am just surprised that a URL with no associated content would be included in the index.
But now that I think about it more, why not? It will not show up except in extremely specific searches, and in those cases it is useful to the searcher.
4th result down (wbpreview.com) is shown in search results despite blocking crawling/indexing with robots.txt. The result displays "A description for this result is not available because of this site's robots.txt – learn more" and the title seems to be auto-generated. The goal was to de-index the listing but apparently that's not an option.
As franze pointed out, you can specify not to index in robots.txt (I have not confirmed this). The intent of dissalowing crawling is ambigous. Maybe they do not want their content cached, or the extra load on their server, or any number of reasons.
If you need to de-index a site, you should use the robots.txt directive. If it has already been indexed and you need it de-indexed quickly, google offers tools to do so [1]
The way to prevent a site from being indexed at all is through a <meta name="robots" content="noindex,nofollow"> tag on the page or X-Robots-Tag HTTP header (both of which, ironically, require that you not robots.txt it out, because otherwise the page content will never be crawled), or through a Noindex directive in robots.txt (which is unspecified by the spec - Google supports it, but Yahoo and Bing don't).
Most of which, to be fair, seem to be descriptions of this exploit, or pages listing open cams. The number of cams actually accessible is a fraction of those, and the number unintentionally left open a smaller fraction again.
https://www.google.com/search?q=inurl%3A%22viewerframe%3Fmod...
10 600 results
when you click on one result
i.e.: 202.212.193.26:555/CgiStart?page=Single&Mode=Motion&Language=0
then you see in the head of the frameset (and similar in every framed html document)
so basically, these HTML abominations should not get indexed if google would follow these indexing directives (basically google invented these meta tags themselves)google is evil? nope - they really follow these directives.
so why is this indexed?
take a look at
http://202.212.193.26:555/robots.txt
the robots.txt is a crawling directive, google can't crawl the (current) version of these pages, so google doesn't see the indexing directive. but as crawling is optional for indexing URLs, this gets indexed.how could this be solved, well: either get rid of the robots.txt or
the noindex robots.txt directive is specified nowhere, but it works nonetheless.