寻求关于在我的RAG流程中改善水印PDF的OCR的建议
我正在开发一个小型的RAG(检索增强生成)管道,遇到了一个与OCR(光学字符识别)相关的具体技术问题。我使用PyMuPDF进行文本提取,但每当PDF的每一页上都有一个居中的水印时,OCR的效果就会变得很差——文本断裂、出现伪影,输出质量下降,进而影响后续的分块和检索准确性。
该文档在其他方面是干净的,因此我想了解这是否是PyMuPDF的已知限制,或者在进行OCR之前是否有更好的处理水印PDF的方法。我正在使用RTX 4000(8GB显存),所以我也希望在合理的GPU限制范围内进行操作。
我非常感谢任何关于以下方面的建议:
- 更加稳健的OCR库或模型,能够很好地处理水印
- 抑制水印文本的预处理策略
- 更适合RAG用例的提取管道
- 或者任何关于改善系统这一部分的通用建议
该项目是开源的,如果有人有兴趣深入挖掘、发现问题或贡献改进,可以访问以下代码库:
GitHub: https://github.com/Hundred-Trillion/L88-Full
如果您觉得这个项目有用,给代码库加星可以提高可见性,让更多具备相关领域专业知识的人注意到它。
提前感谢任何见解。
查看原文
I’ve been developing a small RAG pipeline and ran into a specific technical issue involving OCR. I’m using PyMuPDF for extraction, and whenever a PDF contains a centered watermark on each page, the OCR becomes noisy—text breaks, artifacts show up, and the output degrades enough that it affects chunking and retrieval accuracy downstream.<p>The document is otherwise clean, so I’m trying to understand whether this is a known limitation of PyMuPDF or if there are better approaches for handling watermarked PDFs before OCR. I’m working with an RTX 4000 (8GB VRAM), so I’m also trying to stay within reasonable GPU constraints.<p>I’d really appreciate any ideas on:<p>more robust OCR libraries or models that handle watermarks well<p>preprocessing strategies to suppress watermark text<p>better extraction pipelines for RAG use cases<p>or any general advice on improving this part of the system<p>The project is open-source, and if anyone is interested in digging deeper, finding issues, or contributing improvements, here’s the repository:<p>GitHub: https://github.com/Hundred-Trillion/L88-Full<p>If you find it useful, starring the repo helps increase visibility so more people with domain expertise might notice it.<p>Thanks in advance for any insights.