Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just randomly searched for a term and the article data appears full of typos and OCR mistakes in the sample I used.

Makes me wonder if this is a bigger problem.



Yes, the OCR problem is also very interesting to me. I have been looking into the OCR/labeling field recently, and it seems to still be an actively researched field. There is a surya-ocr that was recently posted to YC, transformer-based, but it is still expensive if running on a really large dataset like 100 years of newspaper. Tesseract doesn't seem to handle this kind of thing very well. In the paper, they just mention it was an active-learning type of method.


Exciting work. I hope they consider releasing the 138 million scans.

The FAQ at the end is a bit confusing, because it says:

  Was the “raw” data saved in addition to the 
  preprocessed/cleaned/labeled data (e.g., to support
  unanticipated future uses)? If so, please provide a link or 
  other access point to the “raw” data.

  All data is in the dataset.


The raw scans are in the Linrary of Congress digital collection.

(^- oops, example transcription error)

The "end of described process" dataset descrption is at https://huggingface.co/datasets/dell-research-harvard/newswi...

and for each record there's "newspaper_metadata", "year", "date", and "article" fields that link back to _a_ LoC newspaper scan.

I stress _a_ singular as much is made of their process to identify articles with multiple reprints and multiple scans across multiple newspapers as these repititions of content (to a degree mitigated by local sub editors) with varying layouts are used to robust the conversion of the scans.

I haven't investigated whether every duplicate article has a seperate record to a distinct scan source ..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: