Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sorting and searching at the library (jorendorff.github.com)
150 points by nkurz on Aug 12, 2012 | hide | past | favorite | 25 comments


Thanks for the great read!

It's my life's dream to do a real-life process vs. algorithm design comparison like this one.

Some day, I will visit hundreds of the world's best kitchens (normalized for ingredients budget) and survey how the instructions filter down from the head chefs there to the prep cooks.

I would love to see the best practices in the process from raw ingredients to completed dishes abstracted to the highest level and then compared with concurrency libraries/best practices to see if there are any places where software design and kitchen instruction exhibit convergent evolution due to pressures they face that are similar on an abstract level.


I, too, would love to see this kitchen-algorithm study done. Not just in how the instructions filter down, but also in how resources are allocated and locked.

What happens when two chefs are ready to fulfill an order at the same time? What happens if a resource, such as a stovetop, is under contention? Does the higher-up chef in the social hierarchy win whenever a resource is under contention (giving a whole new meaning to master-slave), or is there weight given to the priority of the task? Do kitchens evolve a verbal "code" to communicate hierarchy? Are there any large kitchens that don't use the strategy of tacking all orders on the same wall... and how was that strategy developed in the first place? And are any of these things taught in cooking schools?

Heck, this could be a whole book exploring a whole bunch of industries... construction, door-to-door marketing, and more. I have no idea who the ideally-situated person would be to do something like this (maybe it's you, or the OP), but if the right person started a Kickstarter campaign, I'd definitely contribute.


> It's my life's dream to do a real-life process vs. algorithm design comparison like this one.

I've done some simple analysis of card sorting. I've timed myself vs. other people sorting cards, where I use bucket sort and they use whatever they want. I usually beat them by about 20%, but that may just be because I've practiced more. I think bucket sort would have a clear advantage over naïve sorting if the deck had about 100 cards or more.

I've also tested bucket sort against merge sort, and they're about the same speed.

For my bucket sort algorithm, I use two iterations. First I sort the cards into three buckets, where the first contains cards 2 thru 5, the second contains 6 thru 10, and the third contains face cards. Then I do insertion sort on the buckets. On mergesort, I switch to insertion sort for sufficiently short lists.


Atul Gawande (surgeon and writer for The New Yorker) has done some of what you mention.

Read this. http://www.newyorker.com/reporting/2012/08/13/120813fa_fact_...


This was a fascinating article. I do have a small correction: bucket sort is more similar to quicksort than merge sort.

merge sort bucket sort

1. divide (in half) 1. divide (in bins)

2. sort halves 2. sort bins

3. merge 3. done!

(Edit: Sorry, the formatting isn't great. I don't know how to do fixed-width text.)

The big difference here is that for merge sort, dividing the lists is trivial because all you have to do is split it into the first have and the second half. But for bucket sort, you have to actually look at each element to decide which bin it should go in. This is more like quicksort's partition step.


Great article, thank you.

Librarians aren't as dumb as you make them out to be. Several times I've asked a librarian to locate a mis-shelved book I really needed, and they've always succeeded. I gather it's a pain, but it's considered a big victory.

Also, as a library patron, they've taught me to take any book that I notice mis-shelved and set it on a nearby table.

It never ceases to amaze me how human intelligence and care can transform algorithms that are theoretically brittle into systems than endure for centuries.


Librarians don't cheat by posting the prefix of the key on the aisles. That's called radix sort.


The article says this about insertion sort:

This is the fastest sorting algorithm for small arrays, up to maybe 20, 30 elements. Even in software!

No no no. Insertion sort is fastest only when the items being sorted are in nearly sorted order. Its runtime will always be dependent on the number of items, n, being sorted. For a completely randomized array of items, it will run at O(n^2 ) regardless if n is small. Quicksort and Mergesort will do better in these situations.


Any source for that? I am pretty sure insertion sort has a lower constant than quick sort and merge sort. O(n^2) with low constant can beat O(n log n) when n is small.

Wikipedia agrees with me too about quick sort being slower:

https://en.wikipedia.org/wiki/Quicksort#Optimizations

EDIT: Wikipedia says the same about merge sort.

https://en.wikipedia.org/wiki/Merge_sort#Comparison_with_oth...


You bring up very valid points. I was speaking in terms of asymptotic runtime analysis, which does not care about constant overhead. However, the cost-per-operation will be dependent on the architecture that runs it.


You cannot talk about asymptotic behavior ("as n grows to infinity") and "regardless if n is small" at the same time. If you're talking asymptotic, you should ignore the constant. If you're talking about small n, then you can't ignore the constant.


Yes, which gives these arbitrary cutoffs like Java's for sorting varying efficiency depending on the processor architecture. These cutoffs between different algorithms are vital though for implementing fast multiplication of large numbers. GMP has a different set of cutoffs for multiplication for every supported CPU architecture.


Great article! Really hits home for me because I spent a year working in the children's room of a local library. I was that guy who helped people find books, stacked books on shelves, and checked people out.

The worst of the three jobs was definitely stacking books back on shelves. I'd start with a cart filled with books. OP has it totally right — they are grouped (sometimes, but not always!) by general type — but within those type they're usually randomly arranged.

At this point in time I'd been programming for around a year and a half and was just starting to try to learn about different search algorithms. If only I'd realized that I was sorting every day! Looking back, I found the fastest method (for me) to be quicksort. Go figure!


Thanks for the article! I love to see this type of stuff on HN.

It's so interesting how we think of cost in a computer vs what really costs us in human time.


Library isn't RAM. If library books were to be RAM, you'd clone yourself many times and stand in front of each book. And, all of you would share mind. Then performing binary search in the library is sort of okay.

Or, it could be organized in a circle where you are at the center. You don't move. But you pull/push each book with a magnet. But, RAM is more like the former. So many parallel flip flops. No wonder it's expensive.


In the post he says that. They part where he discuss the cost of the algorithm. For computers it is only the comparison. For library it is the comparison and distance to walk.


This is one of the few times I've read something and all the comments are overwhelmingly positive.

I really enjoyed it, it's tricky finding real world metaphors that can be applied and then deconstructed into the technical details, bookmarked to show people struggling to understand algorithms.


I think most of the books at the library I visit already have RFID chips inside. I look forward to reserving a couple of books, walking in, and having the actual physical location of the book light up in a Augmented Reality display.

At that point, it won't even be necessary to sort.


Sorting is still necessary... but you might be interested in ShelvAR - http://shelvar.com/


That is cool. I wonder if a good library worker is faster than Shelvar?


That's fun to imagine, but wouldn't it be even better if you could get the books sent to your e-reader/tablet/computer in seconds? Then there's no need to travel to the library.

I haven't been to a library in years (San Francisco's main library might as well be a homeless shelter), so I don't have a clue how popular e-books are in public libraries.


Ah, but then you have avoided the library's best feature... the sorted shelf!

Just the other day I was looking for "Roadside Geology of Vermont and New Hampshire," and happened across "In Suspect Terrain" by John McPhee -- an absolute gem. (Hint: don't write down the entire catalog number of the book you want; force yourself to look for a bit.)


Awesome writeup! I will definitely be pointing friends who are curious about algorithms to this.


Ok, I'm sorry to be the negative guy here, but there is some thing terribly wrong in the article. The Library is not an array. It's more of a hash table than an array. You first calculate the hash for what you are searching for, go to that location, and search there.

Once you stop thinking of a library as an array, and be more realistic, you could understand why what the writer is talking about doesn't make much sense.


Really great. I wonder if there are other algos that are being overlooked because we are looking at them with the wrong cost model in mind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: