HackerNews中文版

大多数文档解析器在面对复杂的现实挑战时表现不佳，例如复杂表格、手写文档、历史文档扫描、方程式、多列布局、复杂的阅读顺序等。为了解决这些问题，我们开发了Unsiloed Parser。我们的最新解析器v3.1在olmOCR-Bench中获得了第一名，并且严格通过率达到了88.0%。我们在1,403个PDF文件和8,413个单元测试中进行了评估，使用未经修改的上游Allen AI评分器（olmocr==0.4.27），发现Unsiloed超越了包括GPT-5.5、Claude Opus 4.7、LlamaParse、Reducto、Azure Document Intelligence、AWS Textract和Unstructured在内的18个其他OCR服务。当我们深入分析失败案例时，发现许多错误并不是OCR错误，而是诸如\frac与\dfrac、空格差异或等效的LaTeX渲染等问题。我们进行了二次LLM作为评判者的评估，以区分真正的错误与语义等价，这使得修正后的得分提升至94.8（在博客中有详细解释）。完整的方法论和示例请查看博客： [https://www.unsiloed.ai/blog/unsiloed-ai-achieves-1-rank-on-olmocr-bench-2](https://www.unsiloed.ai/blog/unsiloed-ai-achieves-1-rank-on-olmocr-bench-2) 可重复性评估代码： [https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark](https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark) 欢迎在评论中发布您最复杂的PDF文件，我们将通过Unsiloed解析器进行处理，并在此分享输出结果。

查看原文

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to handle exactly these cases.Our latest parser v3.1 achieved #1 rank and scored 88.0 strict pass-rate on olmOCR-Bench. We ran the evaluation across 1,403 PDFs and 8,413 unit tests using the unmodified upstream Allen AI scorer (olmocr==0.4.27) and found Unsiloed beats 18 other OCR services, including GPT-5.5, Claude Opus 4.7, LlamaParse, Reducto, Azure Document Intelligence, AWS Textract, and Unstructured.When we dug deeper into the failure cases, we found many errors were not OCR errors but things like \frac vs \dfrac, whitespace differences, or equivalent LaTeX renderings. We ran a secondary LLM-as-Judge evaluation to classify real misses vs semantic equivalents, which lifts the corrected score to 94.8 (explained deeply in the blog post).Blog with full methodology and examples: <a href="https://www.unsiloed.ai/blog/unsiloed-ai-achieves-1-rank-on-olmocr-bench-2">https://www.unsiloed.ai/blog/unsiloed-ai-achieves-1-rank-on-...</a>Evaluation Code for reproducibility: <a href="https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark" rel="nofollow">https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark</a>Feel free to post your messiest PDFs in the comment and we'll run it through Unsiloed parser and share the output here.

展示HN：去孤岛化的人工智能 – 在olmOCR-Bench上排名第一