Yes, the OCR problem is also very interesting to me. I have been looking into the OCR/labeling field recently, and it seems to still be an actively researched field. There is a surya-ocr that was recently posted to YC, transformer-based, but it is still expensive if running on a really large dataset like 100 years of newspaper. Tesseract doesn't seem to handle this kind of thing very well. In the paper, they just mention it was an active-learning type of method.
Exciting work. I hope they consider releasing the 138 million scans.
The FAQ at the end is a bit confusing, because it says:
Was the “raw” data saved in addition to the
preprocessed/cleaned/labeled data (e.g., to support
unanticipated future uses)? If so, please provide a link or
other access point to the “raw” data.
All data is in the dataset.
and for each record there's "newspaper_metadata", "year", "date", and "article" fields that link back to _a_ LoC newspaper scan.
I stress _a_ singular as much is made of their process to identify articles with multiple reprints and multiple scans across multiple newspapers as these repititions of content (to a degree mitigated by local sub editors) with varying layouts are used to robust the conversion of the scans.
I haven't investigated whether every duplicate article has a seperate record to a distinct scan source ..
Makes me wonder if this is a bigger problem.