HackerNews中文版

嗨，HN，和很多人一样，我对发布的爱泼斯坦/麦克斯韦法庭文件感到沮丧，因为它们大多数是没有文本层的扫描图像（PDF），这使得无法进行Ctrl+F搜索或程序化分析。我使用Python、Tesseract和OpenSearch构建了一个处理流程来解决这个问题。网站： [https://epsteinfilez.com](https://epsteinfilez.com) 技术栈： - 数据处理：使用ocrmypdf（Tesseract）的Python工作进程对原始文件进行并行OCR处理。 - 搜索：使用OpenSearch对提取的文本进行索引。 - 前端：使用Next.js（SSR）构建用户界面。 - 基础设施：自托管的Docker集群。功能： - 在大约15,000页上实现亚秒级全文搜索。 - 在PDF页面上直接高亮搜索词。 - 深度链接到特定页面/文档。这是一款透明度工具，而非政治工具。我希望让原始的主要来源对研究人员和记者可访问。欢迎对搜索相关性或索引流程提供反馈！

查看原文

Hi HN,Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically.I built a pipeline to fix this using Python, Tesseract, and OpenSearch.The Site: <a href="https://epsteinfilez.com" rel="nofollow">https://epsteinfilez.com</a>The Stack:Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files.Search: OpenSearch for indexing the extracted text.Frontend: Next.js (SSR) for the UI.Infrastructure: Self-hosted Docker swarm.Features:Sub-second full-text search across ~15,000 pages.Highlights search terms directly on the PDF page.Deep linking to specific pages/documents.This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists.Feedback on the search relevance or indexing pipeline is welcome!

展示HN：埃普斯坦文件的全文搜索引擎（OCR和OpenSearch）