ok, i just googled https://www.google.com/search?q=inurl%3A%22viewerframe%3Fmode...

MiguelHudnandez · on Jan 25, 2013

Can you elaborate on how crawling is optional for indexing? Isn't crawling a prerequisite to indexing?

The only exceptions I can think of are scary, like operating a caching proxy and scraping the cached data. Or scraping data from browsers that have loaded pages by user request.

nostrademons · on Jan 25, 2013

You can discover a URL through finding a link to it on a publicly-accessible web page, even if crawling that link itself is not possible.

MiguelHudnandez · on Jan 25, 2013

Ohh, got it, thank you. So Google is aware that the URL exists, even though they know nothing about the content served at that URL.

I am just surprised that a URL with no associated content would be included in the index.

But now that I think about it more, why not? It will not show up except in extremely specific searches, and in those cases it is useful to the searcher.

coderdude · on Jan 26, 2013

I find this behavior annoying. Here's why:

https://www.google.com/search?q=unicorn+admin

4th result down (wbpreview.com) is shown in search results despite blocking crawling/indexing with robots.txt. The result displays "A description for this result is not available because of this site's robots.txt – learn more" and the title seems to be auto-generated. The goal was to de-index the listing but apparently that's not an option.

gizmo686 · on Jan 26, 2013

As franze pointed out, you can specify not to index in robots.txt (I have not confirmed this). The intent of dissalowing crawling is ambigous. Maybe they do not want their content cached, or the extra load on their server, or any number of reasons. If you need to de-index a site, you should use the robots.txt directive. If it has already been indexed and you need it de-indexed quickly, google offers tools to do so [1]

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&...

coderdude · on Jan 26, 2013

Thank you for pointing that out to me.

nostrademons · on Jan 26, 2013

The way to prevent a site from being indexed at all is through a <meta name="robots" content="noindex,nofollow"> tag on the page or X-Robots-Tag HTTP header (both of which, ironically, require that you not robots.txt it out, because otherwise the page content will never be crawled), or through a Noindex directive in robots.txt (which is unspecified by the spec - Google supports it, but Yahoo and Bing don't).

scoot · on Jan 26, 2013

"ok, i just googled [...] 10 600 results"

Most of which, to be fair, seem to be descriptions of this exploit, or pages listing open cams. The number of cams actually accessible is a fraction of those, and the number unintentionally left open a smaller fraction again.