Hacker News "Submit" functionality needs a Digg-like duplicate alert

brk · on March 24, 2008

Usually when I've submitted a dupe story, if it already exists it just adds a vote to the pre-existing story.

Where the logic either breaks (or allows people to subvert it) is where different URL's can get you to the same story. Often times URL's contain some superfluous flags that don't change the content served, but just serve to log some referrer or layout type data (I'm sure everyone reading this site gets how this works). Adding, removing, or changing any of this data seems to pretty much break or confuse whatever dupe-detector logic exists.

jharper · on March 24, 2008

Human dupe-detection would be an excellent extension to this process.

Xichekolas · on March 24, 2008

Why not just compare the content at the other end of the link with the contents of existing links.

It wouldn't be that hard. Whenever a link is submitted, YC's server would visit the link, get the response, strip all html tags and white space from it, then hash whatever is left. It would then store this hash value with the link. Whenever a new story is submitted, it is likewise hashed and then a check is made for an existing link with the same hash value. If it exists, it's a dupe, if not, allow it.

This would be an extra check to the existing dupe URL string of course. It still wouldn't catch every single thing, but it should eliminate quite a few easy dupes.

If that turns to have a low success rate, try hashing the page title or maybe the http headers.

joshwa · on March 24, 2008

A single comment or timestamp would change the hash.

Maybe the <title>, or the contents of the first <h1> or something would be a better proxy.

Xichekolas · on March 24, 2008

Yeah that is what I was thinking when I added that last line.

For some reason I initially wasn't thinking about comments... so the title would be a much better proxy.

brk · on March 24, 2008

Are you suggesting that new submissions route through Mechanical Turk?

LOL.

Once approach might be that when humans detect dupes, they could be reported. Click the "dupe" link, specify the URL(s) of the dupe(s), and submit. The oldest submission "wins", and the data could be used to train a bayesian dupe detector. I imaging that you could start with a URL text match (it's the ends of the string that tend to be different), along with a check of the <title> for the supposedly dupe page, and maybe the first 128 characters of the story text or something.

It actually sounds like a fun project.

pg · on March 24, 2008

It catches most dupes, just not on sites like the NYT that have so many different urls.

far33d · on March 24, 2008

Maybe you could check if the title of the page is the same as well, not as an automatic detection, but instead as a reason to ask "are you sure this isn't the same as foo". This might prevent most of the NYTimes dupes.

derefr · on March 24, 2008

Would it be that hard to just take a fulltext index of each page that hits the hot page? From there, just show anything with some >N% similarity (probably 98 or so, as text ads can affect the site a little bit.)

ph0rque · on March 24, 2008

With your comment, I saw for the first time how the semantic web might be useful.

brlewis · on March 24, 2008

Sounds like a nice-to-have feature. I don't think it's that bad to delete your own post if you notice the dup soon enough. No skin off your nose if the moderators kill it, either. That's happened to me.

jakewolf · on March 24, 2008

Hah, I beat you by 1/2 an hour http://news.ycombinator.com/item?id=144730

manvsmachine · on March 24, 2008

If I notice it early enough, I usually do what you just did and post the link to the original submission as a comment.

chaostheory · on March 24, 2008

the only thing I noticed before (not sure if it's still valid) is that www.website.com is a dif submission than website.com

jey · on March 24, 2008

Those aren't the same URLs; the dupe detector currently only rejects exactly duplicated URLs.

tim2 · on March 24, 2008

The way they have it executed is incredibly annoying.