展示HN:Doc2dict,一个快速的开源文档转字典转换器 - 不使用人工智能

3作者: jgfriedman19999 个月前原帖
doc2dict是一个Python包,可以将HTML和PDF文档转换为保留层次结构的字典。它还支持从HTML文件中提取表格。 <a href="https://github.com/john-friedman/doc2dict">https://github.com/john-friedman/doc2dict</a> <p>速度:</p> <p>* HTML - 单线程每秒处理500页。</p> <p>* PDF - 每秒处理200页,PDF必须具有底层文本结构。由于PDFium的限制,无法进行多线程处理。</p> <p>以下是微软年度报告的示例输出:</p> &gt; &quot;title&quot;: &quot;PART I&quot;, &quot;standardized_title&quot;: &quot;parti&quot;, &quot;class&quot;: &quot;part&quot;, &quot;contents&quot;: { &quot;38&quot;: { &quot;title&quot;: &quot;ITEM 1. BUSINESS&quot;, &quot;standardized_title&quot;: &quot;item1&quot;, &quot;class&quot;: &quot;item&quot;, &quot;contents&quot;: { &quot;39&quot;: { &quot;title&quot;: &quot;GENERAL&quot;, &quot;standardized_title&quot;: &quot;&quot;, &quot;class&quot;: &quot;predicted header&quot;, &quot;contents&quot;: { &quot;40&quot;: { &quot;title&quot;: &quot;Embracing Our Future&quot;, &quot;standardized_title&quot;: &quot;&quot;, &quot;class&quot;: &quot;predicted header&quot;, &quot;contents&quot;: { &quot;41&quot;: { &quot;text&quot;: &quot;Microsoft is a technolo...<p>原始数据: <a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing" rel="nofollow">https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing</a></p> <p>解析后的字典:</p> <a href="https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json">https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json</a> <p>算法的简单描述:</p> <p>* 处理复杂文档,如PDF或HTML,并为其创建一个简化的表示,形式为字典的列表,其中每个字典是一个文本块,包含“粗体”、“字体大小”等关键特征,每一行表示一个新的HTML块或PDF中的一行。</p> <p>* 使用一组预定规则将简化的表示转换为字典,例如,较小的字体大小表示标题应嵌套在较大字体大小的标题下。</p> <p>请注意,我正在努力使最后一部分更加模块化,创建用户可以根据其用例调整的预定指令,而无需重写解析器。我称这些为“映射字典”。</p> <p>doc2dict还包括调试过程的可视化工具:</p> <p>* 可视化简化表示 <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html</a></p> <p>* 可视化输出字典 <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html</a></p> <p>我为什么要做这个:</p> <p>我目前正在开发另一个开源Python包,以便更容易地利用证券交易委员会的数据。编写一个可以调整的通用文档解析器比为每种文档类型编写100个专用解析器要容易得多。</p> <p>此外,将HTML和PDF文件转换为字典表示可以将文档大小减少约10倍。不确定我可以用这个做什么,但计划进行一些有趣的NoSQL数据库实验。</p> <p>其他包的链接(datamule) <a href="https://github.com/john-friedman/datamule-python">https://github.com/john-friedman/datamule-python</a></p>
查看原文
doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files. <a href="https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict">https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict</a><p>Speed:<p>* html - 500 pages per second single threaded.<p>* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.<p>Here&#x27;s an example output from Microsoft&#x27;s Annual Report: &gt; &quot;title&quot;: &quot;PART I&quot;, &quot;standardized_title&quot;: &quot;parti&quot;, &quot;class&quot;: &quot;part&quot;, &quot;contents&quot;: { &quot;38&quot;: { &quot;title&quot;: &quot;ITEM 1. BUSINESS&quot;, &quot;standardized_title&quot;: &quot;item1&quot;, &quot;class&quot;: &quot;item&quot;, &quot;contents&quot;: { &quot;39&quot;: { &quot;title&quot;: &quot;GENERAL&quot;, &quot;standardized_title&quot;: &quot;&quot;, &quot;class&quot;: &quot;predicted header&quot;, &quot;contents&quot;: { &quot;40&quot;: { &quot;title&quot;: &quot;Embracing Our Future&quot;, &quot;standardized_title&quot;: &quot;&quot;, &quot;class&quot;: &quot;predicted header&quot;, &quot;contents&quot;: { &quot;41&quot;: { &quot;text&quot;: &quot;Microsoft is a technolo...<p>Raw: <a href="https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;john-friedman&#x2F;doc2dict&#x2F;refs&#x2F;heads&#x2F;main&#x2F;example_output&#x2F;html&#x2F;msft_10k_2024.html#:~:text=embracing" rel="nofollow">https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;raw.githubuserco...</a><p>Parsed dictionary: <a href="https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict&#x2F;blob&#x2F;main&#x2F;example_output&#x2F;html&#x2F;dict.json">https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict&#x2F;blob&#x2F;main&#x2F;example_...</a><p>Simple description of algorithm:<p>* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as &quot;bold&quot;, &quot;font-size&quot;, etc and each line represents a new html block or line on a pdf.<p>* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.<p>Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these &quot;mapping dicts&quot;.<p>doc2dict also includes visualization tools for the debugging process:<p>* visualize simplified representation <a href="https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict&#x2F;blob&#x2F;main&#x2F;example_output&#x2F;html&#x2F;instructions_visualization.html" rel="nofollow">https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;github.com&#x2F;john-...</a><p>* visualize output dictionary <a href="https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;doc2dict&#x2F;blob&#x2F;main&#x2F;example_output&#x2F;html&#x2F;document_visualization.html" rel="nofollow">https:&#x2F;&#x2F;html-preview.github.io&#x2F;?url=https:&#x2F;&#x2F;github.com&#x2F;john-...</a><p>Why I made this: I&#x27;m currently working on another open source python package to make it easy to exploit Securities &amp; Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.<p>Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.<p>Link to other package (datamule) <a href="https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;datamule-python">https:&#x2F;&#x2F;github.com&#x2F;john-friedman&#x2F;datamule-python</a>