Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DIY Book Scanner (diybookscanner.org)
196 points by KolmogorovComp on Jan 7, 2024 | hide | past | favorite | 52 comments


Remember when Google was cool and not evil and released their book scanner project for free? https://code.google.com/archive/p/linear-book-scanner/

  "Google hereby grants to you a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, transfer, and otherwise run, modify and propagate this design..."


The link that's less likely to be sunset due to someone's random whims: https://github.com/google/linear-book-scanner#readme


I made one of these during my MBA. I spent close to $2000AUD including the two Nikon mirrorless cameras I purchased. I am not particularly handy and made it out of spare 2x4 lumber I had so it wasn’t light.

But it worked, I scanned about two dozen short term library books that I needed to reference frequently during my course at a cost of about $85 per book. If I’d purchased the time limited ebooks they would have cost $125 each, and been scattered across 3 different bookstore apps.

I would scan while watching tv and could do approx 1000 pages per hour.

I also learned that I should not do carpentry and potentially saved tens of thousands by hiring a handyman or carpenter for home diy…


There is an amazing one using plastic plumbing pipes still on youtube

https://www.youtube.com/watch?v=ns3jGFbJvXI


Someone at the time had a CNC aluminium one on ebay for $700 or so. I thought "I can do that cheaper". I was very wrong. The actual parts weren't too bad.(I still had to spend $1400 on the cameras IIRC), but the number of tools I didn't own was a lot higher than I expected.

Still no regrets, I had a fun week of arts and crafts and got to stick it to Elsevier and other academic publishers :D


It does sound like an amazing project you did there.

You are also making a very good point with the surprising effort and equipment this can take. I ran into trouble just fixing something on a heatsink the other day. Turns out in addition to drill, drillbit and tapping set i really need something to keep the drill straight :)


With Covid lockdowns I got a bit depressed and decided to give up my flat to travel around. When I looked at my bookshelf it broke my heart to give away/sell all my books, so I remembered the old story about Brin/Mayer (afair) at Google trying out how long it takes to photograph a book.

I did the same just with my bare hands and my smartphone with a rather short book and calculated it would take me ~2 weeks to create an imperfect (thumbs included) digitized copy of all my books. So that's what I did, eventually improving upon things:

- took a grill from the oven which stabilized the phone and relieved my arm

- created a couple of bash scripts to automate slicing and image compression

- run tesseract-ocr (fulltext search)

- ghostscript for making it a pdf All automated and improved over time. No big bang, just trial and error and tiny steps.

Meanwhile I have hundreds of books. They're not perfect but perfectly readable/searchable. Why I am telling you this? Keep it simple. Unless you're more interested in engineering the "machine" than the actual product it's supposed to create.


That grill trick is a neat discovery. Makes sense that it would do the job as a stabilizer, but it's a clever reuse of some thing most of us have lying around the house. I bet you could use a similar kind of a setup with a baking drying rack to mount a smartphone on motors and rotate the thing in multiple dimensions, but perhaps by that point there's a better approach.


I cannot picture how a grillwould be used


Well you can lie down the phone on the grill without the camera lenses getting covered. Then you build some kinds of pillars (anything; I just used 2 stacks of books on each side) and put the grill with the phone on top. Then with one hand I released the trigger and with the other hand I turned the pages. It worked surprisingly well.


Wasn't searching LibGen more efficient?


For anyone who thinks this is a facetious response, I've done exactly this - use LibGen to find far more useful (to me) formats of books that I own physically. I don't see why I should be punished because I purchased the book in an age where electronic copies were not available. The author and/or their estate have gotten their fair share from me already.


LibGen and the associated SciHub are truly some of the best projects to ever exist.


The original creator of this had to suspend all activities when he got a job at Apple.

https://news.ycombinator.com/item?id=27364737


Apple I believe says this in their contracts but in California (idk if this worker was in CA) it’s illegal to prevent employees from working on projects on personal time using personal resources, aside from maybe carve outs for competing things. It actually really sucks that apple acts like workers can’t moonlight, I’ve had some friends who didn’t do it out of fear of retribution despite it being enshrined in CA law.


Well, what's a related field? Apple has a store that sells eBooks...


This article says the exemptions include, among other things:

“The nature of moonlighting work is in direct conflict with the company. For example, working with or providing consultations to competitors, competing for an employer’s clients, or activities that harm the goodwill or reputation of your employer.”

https://www.aegislawfirm.com/blog/2023/01/california-moonlig...

I think it would be hard to argue that a book scanner competes with an online ebook store, since one is an archival tool and one is a commercial store. Someone could host illegal copies of copyrighted works and they would be competing with apple’s legitimate work, but those people would be the ones competing with apple, not the creator of a tool they used in the process.


There's not just the moonlighting law, there's also copyright assignment for software written. California also protects that, but under different rules.


Does the guy work in that bookstore? If he is in a different dapartment he is probably okay.


According to his website ( https://danreetz.com/resume.htm ), his employment at Apple ended on 2017.


There is this classic – very very impressive but to my knowledge not commercially available and probably never commercialized. 250 pages/min is astounding.

https://youtu.be/03ccxwNssmo


Related:

DIY Book Scanner - https://news.ycombinator.com/item?id=27361815 - June 2021 (124 comments, and btw a great thread)

DIY book scanning - https://news.ycombinator.com/item?id=991897 - Dec 2009 (7 comments)


Writing the software for an earlier version of this was one of the first open source projects I ever did, with an early 0.1x release of React, fond memories So great to see the project is still alive.


"alive" is a strong word but I am keeping it online as long as possible.

miss ya old friend


Shout out to you, my boy


The submission from 2021 with quite a bit of commentary: https://news.ycombinator.com/item?id=27361815


This should be easier in software, reconstructing a 3D model of the relaxed open book from a stereo or multiple photos, then using AI to "upsample" to the scalable PDF document most likely to produce the modeled image.

I was part of a font consulting company during the Postscript / Truetype font wars, and we reconstructed fonts from scans or earlier digital formats. Most of the work was fixing bad data. This all should be easier now; think of Peter Jackson cleaning up the "Let It Be" sound, leading to the Beatles releasing one more track.

It baffles me that book images don't get this quality of attention. As a mathematician I spend a lot of time reading old journal articles that look terrible.


From the front page of the link: “The easiest way to avoid page curl in your images is to flatten the pages by pressing them against glass or acrylic. While there are some computer algorithms that can help dewarp the pages after capture, it is always more reliable to just capture flat pages in the first place.”


> If you have a healthy budget, just buy DSLR cameras and use those.

Do these scanning rigs lock the mirror and shutter of the DSLR?

If not, what MTBF are they looking at, when prosumer DSLR shutter life might be around 50K actuations?


> A "good" shutter count varies depending on the camera model. Entry-level and mid-range DSLR cameras typically have a shutter count rating between 100,000 and 200,000, while professional-grade cameras can range between 400,000 and 500,000. When purchasing a second-hand camera, it's best to choose one with shutter count well below its rating.

https://checkshuttercount.com/nikon


I don't know what's accurate.

This top-search-hit other site has some "Average number of actuations after which shutter died" data is for some older models.

Consumer (lowest, 69K): https://olegkikin.com/shutterlife/canon_eos1000d.htm

Prosumer (98K): https://olegkikin.com/shutterlife/canon_eos30d.htm

That's average, so, if that data is reasonably representative of units in the the wild (I don't know), I'd think a trustworthy rating (and safe expectation) would be lower than that.

The reasons I mentioned shutter life was because I wanted to know how the scanning projects using DSLRs managed that, and also, to suggest to anyone dropping money on a DSLR for this that shutter life might be a cost consideration.


That guy posted this on here when it was new. Also he's from my town


https://diybookscanner.org/forum/viewtopic.php?f=1&p=9034

External power for the cameras is quite neat.

edit: In case anyone is curious, for battery replacements the term is (dual) "battery eliminator"


I used to use a flatbed scanner back in my undergrad, was quite a painful experience :).

Nowadays I just use https://1dollarscan.com. Turns out to be rather expensive, but still beats all that manual work.


Just slice the spine off, and run it through a sheet fed scanner. I know, I know - Sacrilege!

But you can also just set the book on a table, open it, and photograph it with your phone camera. The result is perfectly legible on your monitor.


I love the idea (as long as they are not rare out of print books). Then burn the loose pages of all the books you've scanned in a bonfire to complete the ritual. The books have now transcended the physical realm and it felt really wrong while doing it. History warns us that people will be next, are they right?


Rare out of print books are, of course, rare. The vast bulk are relatively worthless in the used market.


I once asked a special collections librarian in the US to scan a rare book. This is a perfectly normal, common practice in some libraries; for example, I've gotten scans of rare books in a library in Salzburg e-mailed to me for something like twenty euros. This one librarian, however, hadn't heard of that, and the only method he could think of to do it was the one you suggest—slicing the spine and running the pages through a multi-page scanner, and he was horrified at the idea. A long and frustrating conversation ensued as I tried to convince him that I was not trying to get him to destroy rare books in order to digitize them.


Cool to see this reposted! I see on the forums are using dedicated compact cameras. Are they still better than smartphones?

I also wonder if using the LIDAR from iPhones for example could significantly improve the flattening process.


Yay, the forum is back online!


Lots of work behind the scenes. Welcome back.


I saw this and immediately looked to see if someone had finally posted instructions on making a GrumbleGear 3000 scanner (Robin Sloan's "Mr. Penumbra's 24-Hour Bookstore").


I was interested to see how they solved automated page turning... but none of the designs appear to address this?

Sounds like a repetitive motion injury waiting to happen after you get done scanning Fountainhead.


Fully automatic ones do exist, but its generally not done to not risk damage to the books. And without ocr of page numbers you always risked missing pages.

https://www.youtube.com/watch?v=kvM-tjrS2-U

That said, a lot of automation processes in place do destroy the book by cutting the spine and scanning it.


"Ask HN: What's the best out-of-box Document OCR/Analyzing/recognition API?" (2024) https://news.ycombinator.com/item?id=38829242 ; BetterOCR :

> Better text detection by combining multiple OCR engines (EasyOCR, Tesseract, and Pororo)

FWIU there's a way to image multiple stacked sheets of ancient scrolls without unrolling them?


Is there a kit that can be bought online?

I may not have the time and energy to do a DIY, but have an immediate requirement to procure one. Any leads / directions / websites would be really appreciated.

Thank you!


Not a DIY kit, but in case LibGen is insufficient and you need a commercially available solution, ScanSnap scanners have a model for this purpose.

https://www.pfu-us.ricoh.com/scanners/scansnap/sv600

The software has automatic page-turn detection (so you don’t have to repeatedly press the scan button); has page-curve correction and deskew; and automatically removes fingers/thumbs from the image, in case you need to hold the pages down. Neat!

Like another commenter, I used 1dollarscan to digitize many books (to save space) but I agree that that process was more expensive than expected (and destroyed my physical books, which I have come to regret). If I had known about the scanner I just linked to, I probably would have invested in one instead.

Off-topic, but apparently Ricoh has acquired the ScanSnap brand from Fujitsu. (News to me, at least.) But unless Ricoh has changed something, in my experience it’s hard to go wrong with the ScanSnap brand for personal scanning needs.

(I have no affiliation with the companies or brands mentioned.)



Thanks and really appreciate it.

But none of them are available or selling it currently.

Also, the forum looks dead / defunct.

No noticeable activity on any thread! Sad to see such a vibrant community gone extinct.

I know it is not an active area unlike other DIY and the requirements are also very low. But still....!


with the methods that hopefully come out of that research project that aims to read those really old scrolls that can't be touched or opened...

maybe we'll find a method for "CT scanning" a book and using imaging techniques to reconstruct the text inside without needing to flip each page?


> maybe we'll find a method for "CT scanning" a book and using imaging techniques to reconstruct the text inside without needing to flip each page

YES! Quite literally what's happenening here, albeit in a different context:

https://www.nytimes.com/2023/10/12/arts/design/herculaneum-s...


It's getting to the point that we need this again.

Ebooks are basically all epub and that's completely useless for type setting. I've had to contact authors of textbooks to try and get the latex source so I can read what the damned thing says on a screen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: