Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Humanhash: Human-readable representations of digests (github.com/zacharyvoase)
87 points by zacharyvoase on Dec 12, 2011 | hide | past | favorite | 22 comments


The best, by far, solution to this problem that I've seen or used is: https://github.com/singpolyma/mnemonicode ; original docs here: http://web.archive.org/web/20101031205747/http://www.tothink...


Anyone doing something about the UX of security is a friend.

I agree with various commenters' would-be-nice features (larger word list for more bits represented per word, better phonetic distance between items in the word list), and have one of my own (an implementation in C with no library dependencies would be nice).

Solid effort overall.


It might be useful for confirming that you have the right file over the phone or other crude applications - but you're checking such a small subset of the digest that crafting a fake payload with the same humanhash becomes down right trivial.

With that in mind, the code could be much simpler: Each possible pair of hex digits gets assigned to one of 256 words, and the humanhash is the words for the first n pair of the digest.

Or am I missing something big?


I wanted the humanhash to depend on the whole of the input, so I use a compression function (similar to a checksum) rather than simply selecting the first `n` bytes of input.

The humanhash is not intended to guarantee any level of collision protection, it's simply supposed to be a human-memorable representation of a much longer underlying digest. When computers exchange representations, they should certainly use the full digest.

However, even with only four bytes, you're looking at a 1 in 4.3 billion chance of collision. So for user interfaces, where a user merely has to distinguish between different values, or even a situation where two people want to check that they're talking about (probably) the same thing, it remains useful.


Your choice of words seems poor. Many of the words sound similar, even words next to each other in the list such as "winner"/"winter", "march"/"mars" and "six"+"ten"/"sixteen". You might want to look at e.g. http://en.wikipedia.org/wiki/PGP_word_list to see how others have solved the same problem.

It doesn't appear that you've made any attempt to secure this humanhash against a malicious third-party.

If the input to humanhash really is the result of a cryptographically-strong hash function, then at first glance it looks like the "XOR everything into an n-byte buffer" is no stronger (and potentially actually weaker) than the naive "just use the first n bytes". (Disclaimer: not my field, although I did take a couple of formal crypto classes once.)

If the input to humanhash is arbitrary, and you aren't protecting against a malicious third party, then why not use a much-better understood CRC rather than "XOR everything into an n-byte buffer"?

EDIT: On reflection, the input to humanhash has to be short (because the compression function needs to know the length of the input and so can't operate on a stream) and of known length (because you're just xor'ing, and it's trivial to get two inputs to "compress" to the same thing by padding one with NUL bytes.)


You're correct, humanhash was never intended to be secure. I felt that, for usability, I'd have to surrender some collision protection in exchange for brevity. I was originally taking the first N bytes, but I decided to use a simple XOR-based checksum over N chunks of the input instead. This may have been cargo cult cryptography.

I hadn't seen the PGP word list; thanks a lot for that. I would integrate that list this very second if I could, but I'm concerned that PGP owns the copyright to the list.


It's generally accepted, with modern cryptographic hash functions, that's it's better to just take the first N bytes, rather than run a naive checksum over the hash.


If the input is already expected to be a hash, you don't need to do compression, and if it isn't just MD5 it. I forget what the property is called, but a hash is supposed to change a lot for a small change in the input. Thus, any n bytes from the hash represents the hash equally well.

> So for user interfaces, where a user merely has to distinguish between different values, or even a situation where two people want to check that they're talking about (probably) the same thing, it remains useful.

Yes, absolutely, as I also wrote. I just wanted (and maybe you do too) to warn that it's not appropriate for sensitive stuff, while the relative complexity of the implementation could suggest that's the case.


> I forget what the property is called, but a hash is supposed to change a lot for a small change in the input.

http://en.wikipedia.org/wiki/Avalanche_effect


I have a slightly more complex scheme that eliminates some of the drawbacks of humanhash, while adding some of its own. Given a word-hash, and the number of bits per word that were used to generate it, it gives back the original numeric hash in the obvious way (by hashing each word and truncating the hash, then concatenating the bits); in the other direction, given a numeric hash, a wordlist to draw from, and a desired number of bits per word, it finds all the words in the dictionary that truncate to the appropriate hash bits, and offers the user a choice of them, then concatenates the chosen words.

This has the nice property that a given set of words uniquely translates back to a specific hash, and that the user can use their favorite wordlist as long as it's large enough. If you were to use it for recognition rather than memorization (and wanted fewer words to remember) you would have to manually decide where to truncate the hash before converting it. If you wanted to pick a specific wordlist, you could do that.

I originally came up with this as a system for helping to memorize IPv6 addresses; it also works as a system for passphrase generation (first generate a random number with the desired number of bits of entropy, then convert to words.)


This seems to work by "compressing the input digest to a fixed number of bytes, then mapping those bytes to one of 256 words." So since there are only 256 words you end up with a lot less information than you put in, unless you have some 16 words.

It would be better if we had a much larger dictionary, since we could map more bytes to a single word -- with a list of 65,536 words we would need only 8 words per 16 byte hash, which I think is a lot more manageable.


I wouldn't mind using a list of 65,536 words, but I came up with the existing list by hand, and it took me a good half-hour. I guess it would be possible to generate a wordlist, or even combine phonemes to make fake but valid-sounding words (similar to a password generator), but for a proof-of-concept I feel it has value.

Note that I never intended this to have all the uniqueness of the original digest. It's merely supposed to be 'good enough' for UI purposes.


I wonder how many bits of information could be packed into a single (nonsensical) syllable, without being too prone to misinterpretation.

https://en.wikipedia.org/wiki/English_phonology#Phonotactics

I think this is getting a little beyond my linguistics skills...

Edit: I suppose which syllables are easily corruptable depends on the medium: written hashes would require a different set of graphemes than spoken hashes would phonemes, for example, while memorised hashes would likely be different again. I wonder if there's been any research or anything into this.


You could use nltk's wordnet to generate a bunch of words of certain kinds. As a proof of concept, UNIX should have a list of words, probably in /usr/share/dict/words (OSX) or maybe /usr/dict/words.

/usr/share/dict/words is likely to be pretty big, over 200k words, but it's still human readable. You could prune it by removing proper words (i.e. anything capitalized).


Also might be a good idea to take the Metaphone of the words, and only select words from the dictionary that have unique metaphones.


To make this useful, both parties verifying a key need to use exactly the same word list. So either there needs to be a standardized wordlist or the words have to be generated by the algorithm itself (eg. using syllables as others suggested). Standardizing an algorithm is usually simpler than an arbitrary word list.


Very cool. Looking at the sources it seems to further digest the digest to only 4 bytes by default. It would be cool to increase the size of the dictionary to say 12 bits (instead of 8 bits)


Wonder if the author has considered SSH Randomart as a good solution to this problem. IMHO our brains are better suited to finding differences on images than on word series, as verbose as they might be.


The hashes would be easier to remember if they were sentences saying something funny. For example "Green elephant riding a bicycle" or "Golden ostrich selling a train" etc.



PGP presents something readable like this for key fingerprint confirmation over the phone. I think it uses the entire thing though.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: