Google is amazingly good at digging up sites out of nowhere. I wonder if it is a...

diminoten · on Jan 25, 2013

> I wonder if it is a combination of URLs passing through Chrome, GMail, any android phone, and so on.

That would be incredibly alarming, and quite possibly the largest breach of trust perpetrated by a company so far this decade.

MiguelHudnandez · on Jan 25, 2013

While having the Bing toolbar installed in IE, any URL one types or visits is submitted to Microsoft, and they actively use this data to tune their Bing search results.

http://www.wired.com/business/2011/02/bing-copies-google/

I agree it'd be alarming and terrible, but hardly a new development.

Edit: it's doubtful that an e-mail provider would automatically fetch links from e-mails -- think about them clicking 'unsubscribe' links and links to reject the transfer of domain names. It would break in very obvious ways. IMs and texts, on the other hand, might be more opaque to that kind of meddling.

0x0 · on Jan 25, 2013

It'd be interesting to set up a wildcard dns *.some-experiment.example.com, and send various http://links-via-gmail.some-experiment.example.com/somepath , http://links-via-skype.some-experiment.example.com/anotherpa... through a bunch of services, and see which domain names and which full URLs show up in the logs!

MiguelHudnandez · on Jan 25, 2013

That's a really good idea.

I see it getting quite complicated, though! Dimensions I see are: User's OS, User-agent, ISP/Cell carrier, Transmission protocol (smtp, xmpp, http), service provider (google, microsoft/skype, microsoft/msn).

    android.att.xmpp-gtalk.example.com
    android.verizon.http-gtalk.example.com
    win8.verizon-fios.https-gtalk.example.com
    ios.sprint.skype.example.com

Then you might have to also include the sender AND receiver information in the domain, so based on a single request you could see all possible implicated parties.

I also thought about putting the sender in the path of the URI, but I think it should be in the domain name, too. This is because you might get a hit on robots.txt and in that case, you'd only have one half of the route in the domain name.

Finally, including everything in the DNS lets you evaluate whether the name was even resolved, and potentially by whom. Getting a hit that the name was resolved but not fetched over HTTP gives you information about which services might be analyzing links in order to queue them for further investigation.

0x0 · on Jan 26, 2013

Good call on logging DNS, that'd be a very nice early indicator even if no HTTP requests are sent!

I think maybe the domain should be of the format "www.encodedonlywithatoz.yourdomain.com" to maximize whatever regex parsers try to pick up on URLs (i.e, a www. prefix, a .com suffix, and no special chars). You could encode the dimensions via a lookup table to make it less verbose and slightly more obfuscated ("aa" = at&t, "ab" = verizon, etc).

You shouldn't expect data in the path info to be preserved, but it'd be a nice bonus, as you say.

Even more interesting would be some custom DNS software that replies with perhaps a CNAME or something, where you could encode a unique serial number per request. If you had a huge IP range available, you could even resolve to unique IP addresses for every lookup, so you could correlate DNS requests with any HTTP requests that show up later on. A low/near-zero DNS TTL would come in handy.

MiguelHudnandez · on Jan 26, 2013

I like the idea of encoding the data. Or it can be like a URL shortener, where the metadata gets recorded, and a short hash is generated. It complicates the back-end but allows for more comprehensive data storage, and eventual reporting.

Regarding custom DNS software, I might draw from this excellent write-up featured on HN recently:

http://5f5.org/ruminations/dns-debugging-over-http.html

0x0 · on Jan 26, 2013

Nice find!

Also, it'd be interesting to just crank the log level to maximum on a normal piece of DNS software, and post some links around in IM clients and elsewhere, just to see if anything anywhere kicks in. The experiment could be repeated (on different subdomains) with a more clever implementation tricks later.

MiguelHudnandez · on Jan 26, 2013

I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.

I can't wait to send some around in facebook messages and IMs.

Here's a maiden honeypot link: http://hn0001.hnypot.info/Welcome-Internets!

...Though posting it publicly nearly guarantees I will see a hit, I can at least see if code running on HN resolves it immediately.

MiguelHudnandez · on Jan 26, 2013

I ended up just setting up bind with a wildcard entry, and setting its log level for queries to debug. It is working now, but I need to build a little web app to generate the unique links. Also only one DNS server is running at the moment.

I can't wait to send some around in facebook messages and IMs.

Here's a maiden honeypot link: http://hn0001.hnypot.info/Welcome-Internets!

...Though posting it publicly nearly guarantees I will see a hit, I can at least see if code running on HN resolves it immediately.

Edit: There is activity coming in on that name, but mostly it is from browsers pre-loading DNS to prepare for the next potential pageview. My browser did this (chrome on Mac). I suppose that is a form of information disclosure we often overlook. On a page you can inject a link into, you can get some very basic analytics.

In the 15 minutes following the posting of that link, there have been zero clicks, 36 IPv4 lookups, 6 IPv6 lookups.

darklajid · on Jan 25, 2013

Go for it! That sounds like a great idea.

Make sure that your results can be tracked and provide as much information as possible and you got a nice project here..

0x0 · on Jan 25, 2013

Maybe I will, but if anyone else feels like putting in the effort, go ahead, too :)

MiguelHudnandez · on Jan 26, 2013

I just registered hnypot [dot] info for a few bucks and will see if I can get wildcard DNS running with some tracking. Haha, I don't want to type the name as a link until I get the tracking going...

If anybody wants to collaborate or you just want an NS delegation off that name to try to roll your own, just let me know!

stuffihavemade · on Jan 26, 2013

It's possible(even likely) that what you typed is enough for a crawler to try that site.

MiguelHudnandez · on Jan 26, 2013

It is a short name, so it's likely that it'll be found. But the real honeypot would be the large hashed subdomains that you would use as bait.

I don't think the main site or its www subdomain would need to be secret. Of course, if it uncovers some huge invasion of privacy, we might have to set up an army of different domains running similar software on separate IPs to keep it effective.

hdevalence · on Jan 25, 2013

When you type into the chrome omnibox, it sends your keystrokes to Google to give you search suggestions. Just using those would be sufficient, and I don't think they really hide that it's sending your input along. If someone does a Google search for a URL, we expect it to get added to the index; why is it different when the search occurs in the omnibox rather than their web interface?

diminoten · on Jan 26, 2013

URLs don't go through google when you type them in Chrome's URL box, they go directly to the address. Chrome only sends things to Google that it can't interpret as a URL.

randallu · on Jan 26, 2013

Actually it looks like URLs do go to google's suggestion service as you type into Chrome's URL box.

I tried typing "http://then and it suggested a UPS package URL and "thenicestplaceontheinter.net". I've not been to either of those pages before (I use Chrome for testing, so I'm not signed into it, etc).

diminoten · on Jan 27, 2013

I can tell you assertively that, if you capture packets while using Google Chrome's address bar for URLs, it does not send the data to Google.

0x0 · on Jan 25, 2013

Good point, come to think of it, the last time I had to deal with this, it could have been caused by a number of entries into google, such as embedding google analytics .js even on the staging site, or running a test on the google page speed tools, etc. But they certainly have a HUGE amount of opportunity to snap up new URLs across all their services.