Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ok, i just googled

https://www.google.com/search?q=inurl%3A%22viewerframe%3Fmod...

10 600 results

when you click on one result

i.e.: 202.212.193.26:555/CgiStart?page=Single&Mode=Motion&Language=0

then you see in the head of the frameset (and similar in every framed html document)

  <META NAME="robots" CONTENT="none">
  <META NAME="robots" CONTENT="noindex,nofollow">
  <META NAME="robots" CONTENT="noarchive">
so basically, these HTML abominations should not get indexed if google would follow these indexing directives (basically google invented these meta tags themselves)

google is evil? nope - they really follow these directives.

so why is this indexed?

take a look at

http://202.212.193.26:555/robots.txt

  User-Agent: * 
  Disallow: /
the robots.txt is a crawling directive, google can't crawl the (current) version of these pages, so google doesn't see the indexing directive. but as crawling is optional for indexing URLs, this gets indexed.

how could this be solved, well: either get rid of the robots.txt or

  User-Agent: * 
  Disallow: /
  Noindex: /
the noindex robots.txt directive is specified nowhere, but it works nonetheless.


Can you elaborate on how crawling is optional for indexing? Isn't crawling a prerequisite to indexing?

The only exceptions I can think of are scary, like operating a caching proxy and scraping the cached data. Or scraping data from browsers that have loaded pages by user request.


You can discover a URL through finding a link to it on a publicly-accessible web page, even if crawling that link itself is not possible.


Ohh, got it, thank you. So Google is aware that the URL exists, even though they know nothing about the content served at that URL.

I am just surprised that a URL with no associated content would be included in the index.

But now that I think about it more, why not? It will not show up except in extremely specific searches, and in those cases it is useful to the searcher.


I find this behavior annoying. Here's why:

https://www.google.com/search?q=unicorn+admin

4th result down (wbpreview.com) is shown in search results despite blocking crawling/indexing with robots.txt. The result displays "A description for this result is not available because of this site's robots.txt – learn more" and the title seems to be auto-generated. The goal was to de-index the listing but apparently that's not an option.


As franze pointed out, you can specify not to index in robots.txt (I have not confirmed this). The intent of dissalowing crawling is ambigous. Maybe they do not want their content cached, or the extra load on their server, or any number of reasons. If you need to de-index a site, you should use the robots.txt directive. If it has already been indexed and you need it de-indexed quickly, google offers tools to do so [1]

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&...


Thank you for pointing that out to me.


The way to prevent a site from being indexed at all is through a <meta name="robots" content="noindex,nofollow"> tag on the page or X-Robots-Tag HTTP header (both of which, ironically, require that you not robots.txt it out, because otherwise the page content will never be crawled), or through a Noindex directive in robots.txt (which is unspecified by the spec - Google supports it, but Yahoo and Bing don't).


"ok, i just googled [...] 10 600 results"

Most of which, to be fair, seem to be descriptions of this exploit, or pages listing open cams. The number of cams actually accessible is a fraction of those, and the number unintentionally left open a smaller fraction again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: