Ultimately, there’s some intersection of accuracy x cost x speed that’s ideal, which can be different per use case. We’ll surface all of those metrics shortly so that you can pick the best model for the job along those axes.
We wanted to keep the focus on (1) foundation VLMs and (2) open source OCR models.
We had Mistral previously but had to remove it because their hosted API for OCR was super unstable and returned a lot of garbage results unfortunately.
Paddle, Nanonets, and Chandra being added shortly!
MistralOCR works stably for me when first uploading the file to their server and then running the OCR. I also had some issues before when giving a URL directly to the OCR API, not sure if you're doing that?
School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.
Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).
What we've seen help in these cases are:
1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.
2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.
The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).
thanks! Datalab is great, I've met Vik a few times and their team has done some impressive work. We can also support the conversion to markdown use case, and might be a better fit depending on your use case. Feel free to create an account to try it out!
It's very dependent on the use case. That's why we offer a native evals experience in the product, so you can directly measure the % accuracy diffs between the two modes for your exact docs.
As a rule of thumb, light processing mode is great for (1) most classification tasks, (2) splitting on smaller docs, (3) extraction on simpler documents, or (4) latency sensitive use cases.
Exactly correct! We've had users migrate over from other providers because our granular pricing enabled new use cases that weren't feasible to do before.
One interesting thing we've learned is, most production pipelines often end up using a combination of the two (e.g. cheap classification and splitting, paired with performance extraction).
Feedback heard. Pricing is hard, and we've iterated on this multiple times so far.
Our goal is to provide customers with as much transparency & flexibility as possible. Our pricing has 2 axes:
- the complexity of the task
- performance processing vs cost-optimized processing
Complexity matters because e.g. classification is much easier than extraction, and as such it should be cheaper. That unlocks a wide range of use cases, such as tagging and filtering pipelines.
Toggles for performance is also important because not all use cases are created equal. Similar to how having options between cheaper and the best foundation models is important, the same applies to document tasks.
For certain use cases, you might be willing to take a slight hit to accuracy in exchange for better costs and latency. To support this, we offer a "light" processing mode (with significantly lower prices) that uses smaller models, fewer VLMs, and more heuristics under the hood.
For other use cases, you simply want the highest accuracy possible. Our "performance" processing mode is a great fit for that, which enables layout models, signature detection, handwriting VLMs, and the most performant foundation models.
In fact, most pipelines we seen in production often end up combining the two (cheap classification and splitting, paired with performance extraction).
Without this level of granularity, we'd either be overcharging certain customers or undercharging others. I definitely understand how this is confusing though, we'll work on making our docs better!
Our goal is to provide customers with as much flexibility as possible. For certain use cases, you might be willing to take a slight hit to accuracy in exchange for better costs and latency. To support this, we offer a "light" processing mode (with significantly lower prices) that uses smaller models, fewer VLMs, and more heuristics under the hood.
For other use cases, you simply want the highest accuracy possible. Our "performance" processing mode is a great fit for that, which enables layout models, signature detection, handwriting VLMs, and the most performant foundation models.
We back this up with a native evals experience in the product, so you can directly measure the % accuracy difference between the two modes for your exact use case.
reply