Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google is amazingly good at digging up sites out of nowhere. I wonder if it is a combination of URLs passing through Chrome, GMail, any android phone, and so on. It's always a hassle keeping staging/dev sites out of the index if you're not careful with all the right meta noindex and robots.txt tags. (robots.txt with disallow all, on its own, won't keep sites/URLs from showing up in the results, at best just hide the cached body text summary below the link)


> I wonder if it is a combination of URLs passing through Chrome, GMail, any android phone, and so on.

That would be incredibly alarming, and quite possibly the largest breach of trust perpetrated by a company so far this decade.


While having the Bing toolbar installed in IE, any URL one types or visits is submitted to Microsoft, and they actively use this data to tune their Bing search results.

http://www.wired.com/business/2011/02/bing-copies-google/

I agree it'd be alarming and terrible, but hardly a new development.

Edit: it's doubtful that an e-mail provider would automatically fetch links from e-mails -- think about them clicking 'unsubscribe' links and links to reject the transfer of domain names. It would break in very obvious ways. IMs and texts, on the other hand, might be more opaque to that kind of meddling.


It'd be interesting to set up a wildcard dns *.some-experiment.example.com, and send various http://links-via-gmail.some-experiment.example.com/somepath , http://links-via-skype.some-experiment.example.com/anotherpa... through a bunch of services, and see which domain names and which full URLs show up in the logs!


That's a really good idea.

I see it getting quite complicated, though! Dimensions I see are: User's OS, User-agent, ISP/Cell carrier, Transmission protocol (smtp, xmpp, http), service provider (google, microsoft/skype, microsoft/msn).

    android.att.xmpp-gtalk.example.com
    android.verizon.http-gtalk.example.com
    win8.verizon-fios.https-gtalk.example.com
    ios.sprint.skype.example.com
Then you might have to also include the sender AND receiver information in the domain, so based on a single request you could see all possible implicated parties.

I also thought about putting the sender in the path of the URI, but I think it should be in the domain name, too. This is because you might get a hit on robots.txt and in that case, you'd only have one half of the route in the domain name.

Finally, including everything in the DNS lets you evaluate whether the name was even resolved, and potentially by whom. Getting a hit that the name was resolved but not fetched over HTTP gives you information about which services might be analyzing links in order to queue them for further investigation.


Good call on logging DNS, that'd be a very nice early indicator even if no HTTP requests are sent!

I think maybe the domain should be of the format "www.encodedonlywithatoz.yourdomain.com" to maximize whatever regex parsers try to pick up on URLs (i.e, a www. prefix, a .com suffix, and no special chars). You could encode the dimensions via a lookup table to make it less verbose and slightly more obfuscated ("aa" = at&t, "ab" = verizon, etc).

You shouldn't expect data in the path info to be preserved, but it'd be a nice bonus, as you say.

Even more interesting would be some custom DNS software that replies with perhaps a CNAME or something, where you could encode a unique serial number per request. If you had a huge IP range available, you could even resolve to unique IP addresses for every lookup, so you could correlate DNS requests with any HTTP requests that show up later on. A low/near-zero DNS TTL would come in handy.


I like the idea of encoding the data. Or it can be like a URL shortener, where the metadata gets recorded, and a short hash is generated. It complicates the back-end but allows for more comprehensive data storage, and eventual reporting.

Regarding custom DNS software, I might draw from this excellent write-up featured on HN recently:

http://5f5.org/ruminations/dns-debugging-over-http.html


Nice find!

Also, it'd be interesting to just crank the log level to maximum on a normal piece of DNS software, and post some links around in IM clients and elsewhere, just to see if anything anywhere kicks in. The experiment could be repeated (on different subdomains) with a more clever implementation tricks later.


I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.

I can't wait to send some around in facebook messages and IMs.

Here's a maiden honeypot link: http://hn0001.hnypot.info/Welcome-Internets!

...Though posting it publicly nearly guarantees I will see a hit, I can at least see if code running on HN resolves it immediately.


I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.

I can't wait to send some around in facebook messages and IMs.

Here's a maiden honeypot link: http://hn0001.hnypot.info/Welcome-Internets!

...Though posting it publicly nearly guarantees I will see a hit, I can at least see if code running on HN resolves it immediately.

Edit: There is activity coming in on that name, but mostly it is from browsers pre-loading DNS to prepare for the next potential pageview. My browser did this (chrome on Mac). I suppose that is a form of information disclosure we often overlook. On a page you can inject a link into, you can get some very basic analytics.

In the 15 minutes following the posting of that link, there have been zero clicks, 36 IPv4 lookups, 6 IPv6 lookups.


Go for it! That sounds like a great idea.

Make sure that your results can be tracked and provide as much information as possible and you got a nice project here..


Maybe I will, but if anyone else feels like putting in the effort, go ahead, too :)


I just registered hnypot [dot] info for a few bucks and will see if I can get wildcard DNS running with some tracking. Haha, I don't want to type the name as a link until I get the tracking going...

If anybody wants to collaborate or you just want an NS delegation off that name to try to roll your own, just let me know!


It's possible(even likely) that what you typed is enough for a crawler to try that site.


It is a short name, so it's likely that it'll be found. But the real honeypot would be the large hashed subdomains that you would use as bait.

I don't think the main site or its www subdomain would need to be secret. Of course, if it uncovers some huge invasion of privacy, we might have to set up an army of different domains running similar software on separate IPs to keep it effective.


When you type into the chrome omnibox, it sends your keystrokes to Google to give you search suggestions. Just using those would be sufficient, and I don't think they really hide that it's sending your input along. If someone does a Google search for a URL, we expect it to get added to the index; why is it different when the search occurs in the omnibox rather than their web interface?


URLs don't go through google when you type them in Chrome's URL box, they go directly to the address. Chrome only sends things to Google that it can't interpret as a URL.


Actually it looks like URLs do go to google's suggestion service as you type into Chrome's URL box.

I tried typing "http://then and it suggested a UPS package URL and "thenicestplaceontheinter.net". I've not been to either of those pages before (I use Chrome for testing, so I'm not signed into it, etc).


I can tell you assertively that, if you capture packets while using Google Chrome's address bar for URLs, it does not send the data to Google.


Good point, come to think of it, the last time I had to deal with this, it could have been caused by a number of entries into google, such as embedding google analytics .js even on the staging site, or running a test on the google page speed tools, etc. But they certainly have a HUGE amount of opportunity to snap up new URLs across all their services.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: