Google is amazingly good at digging up sites out of nowhere. I wonder if it is a combination of URLs passing through Chrome, GMail, any android phone, and so on. It's always a hassle keeping staging/dev sites out of the index if you're not careful with all the right meta noindex and robots.txt tags. (robots.txt with disallow all, on its own, won't keep sites/URLs from showing up in the results, at best just hide the cached body text summary below the link)
While having the Bing toolbar installed in IE, any URL one types or visits is submitted to Microsoft, and they actively use this data to tune their Bing search results.
I agree it'd be alarming and terrible, but hardly a new development.
Edit: it's doubtful that an e-mail provider would automatically fetch links from e-mails -- think about them clicking 'unsubscribe' links and links to reject the transfer of domain names. It would break in very obvious ways. IMs and texts, on the other hand, might be more opaque to that kind of meddling.
I see it getting quite complicated, though! Dimensions I see are: User's OS, User-agent, ISP/Cell carrier, Transmission protocol (smtp, xmpp, http), service provider (google, microsoft/skype, microsoft/msn).
Then you might have to also include the sender AND receiver information in the domain, so based on a single request you could see all possible implicated parties.
I also thought about putting the sender in the path of the URI, but I think it should be in the domain name, too. This is because you might get a hit on robots.txt and in that case, you'd only have one half of the route in the domain name.
Finally, including everything in the DNS lets you evaluate whether the name was even resolved, and potentially by whom. Getting a hit that the name was resolved but not fetched over HTTP gives you information about which services might be analyzing links in order to queue them for further investigation.
Good call on logging DNS, that'd be a very nice early indicator even if no HTTP requests are sent!
I think maybe the domain should be of the format "www.encodedonlywithatoz.yourdomain.com" to maximize whatever regex parsers try to pick up on URLs (i.e, a www. prefix, a .com suffix, and no special chars). You could encode the dimensions via a lookup table to make it less verbose and slightly more obfuscated ("aa" = at&t, "ab" = verizon, etc).
You shouldn't expect data in the path info to be preserved, but it'd be a nice bonus, as you say.
Even more interesting would be some custom DNS software that replies with perhaps a CNAME or something, where you could encode a unique serial number per request. If you had a huge IP range available, you could even resolve to unique IP addresses for every lookup, so you could correlate DNS requests with any HTTP requests that show up later on. A low/near-zero DNS TTL would come in handy.
I like the idea of encoding the data. Or it can be like a URL shortener, where the metadata gets recorded, and a short hash is generated. It complicates the back-end but allows for more comprehensive data storage, and eventual reporting.
Regarding custom DNS software, I might draw from this excellent write-up featured on HN recently:
Also, it'd be interesting to just crank the log level to maximum on a normal piece of DNS software, and post some links around in IM clients and elsewhere, just to see if anything anywhere kicks in. The experiment could be repeated (on different subdomains) with a more clever implementation tricks later.
I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.
I can't wait to send some around in facebook messages and IMs.
I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.
I can't wait to send some around in facebook messages and IMs.
...Though posting it publicly nearly guarantees I will see a hit, I can at least see if code running on HN resolves it immediately.
Edit: There is activity coming in on that name, but mostly it is from browsers pre-loading DNS to prepare for the next potential pageview. My browser did this (chrome on Mac). I suppose that is a form of information disclosure we often overlook. On a page you can inject a link into, you can get some very basic analytics.
In the 15 minutes following the posting of that link, there have been zero clicks, 36 IPv4 lookups, 6 IPv6 lookups.
I just registered hnypot [dot] info for a few bucks and will see if I can get wildcard DNS running with some tracking. Haha, I don't want to type the name as a link until I get the tracking going...
If anybody wants to collaborate or you just want an NS delegation off that name to try to roll your own, just let me know!
It is a short name, so it's likely that it'll be found. But the real honeypot would be the large hashed subdomains that you would use as bait.
I don't think the main site or its www subdomain would need to be secret. Of course, if it uncovers some huge invasion of privacy, we might have to set up an army of different domains running similar software on separate IPs to keep it effective.
When you type into the chrome omnibox, it sends your keystrokes to Google to give you search suggestions. Just using those would be sufficient, and I don't think they really hide that it's sending your input along. If someone does a Google search for a URL, we expect it to get added to the index; why is it different when the search occurs in the omnibox rather than their web interface?
URLs don't go through google when you type them in Chrome's URL box, they go directly to the address. Chrome only sends things to Google that it can't interpret as a URL.
Actually it looks like URLs do go to google's suggestion service as you type into Chrome's URL box.
I tried typing "http://then and it suggested a UPS package URL and "thenicestplaceontheinter.net". I've not been to either of those pages before (I use Chrome for testing, so I'm not signed into it, etc).
Good point, come to think of it, the last time I had to deal with this, it could have been caused by a number of entries into google, such as embedding google analytics .js even on the staging site, or running a test on the google page speed tools, etc. But they certainly have a HUGE amount of opportunity to snap up new URLs across all their services.