展示HN:Doc2dict,一个快速的开源文档转字典转换器 - 不使用人工智能
doc2dict是一个Python包,可以将HTML和PDF文档转换为保留层次结构的字典。它还支持从HTML文件中提取表格。
<a href="https://github.com/john-friedman/doc2dict">https://github.com/john-friedman/doc2dict</a>
<p>速度:</p>
<p>* HTML - 单线程每秒处理500页。</p>
<p>* PDF - 每秒处理200页,PDF必须具有底层文本结构。由于PDFium的限制,无法进行多线程处理。</p>
<p>以下是微软年度报告的示例输出:</p>
> "title": "PART I",
"standardized_title": "parti",
"class": "part",
"contents": {
"38": {
"title": "ITEM 1. BUSINESS",
"standardized_title": "item1",
"class": "item",
"contents": {
"39": {
"title": "GENERAL",
"standardized_title": "",
"class": "predicted header",
"contents": {
"40": {
"title": "Embracing Our Future",
"standardized_title": "",
"class": "predicted header",
"contents": {
"41": {
"text": "Microsoft is a technolo...<p>原始数据:
<a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing" rel="nofollow">https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing</a></p>
<p>解析后的字典:</p>
<a href="https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json">https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json</a>
<p>算法的简单描述:</p>
<p>* 处理复杂文档,如PDF或HTML,并为其创建一个简化的表示,形式为字典的列表,其中每个字典是一个文本块,包含“粗体”、“字体大小”等关键特征,每一行表示一个新的HTML块或PDF中的一行。</p>
<p>* 使用一组预定规则将简化的表示转换为字典,例如,较小的字体大小表示标题应嵌套在较大字体大小的标题下。</p>
<p>请注意,我正在努力使最后一部分更加模块化,创建用户可以根据其用例调整的预定指令,而无需重写解析器。我称这些为“映射字典”。</p>
<p>doc2dict还包括调试过程的可视化工具:</p>
<p>* 可视化简化表示
<a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html</a></p>
<p>* 可视化输出字典
<a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html</a></p>
<p>我为什么要做这个:</p>
<p>我目前正在开发另一个开源Python包,以便更容易地利用证券交易委员会的数据。编写一个可以调整的通用文档解析器比为每种文档类型编写100个专用解析器要容易得多。</p>
<p>此外,将HTML和PDF文件转换为字典表示可以将文档大小减少约10倍。不确定我可以用这个做什么,但计划进行一些有趣的NoSQL数据库实验。</p>
<p>其他包的链接(datamule)
<a href="https://github.com/john-friedman/datamule-python">https://github.com/john-friedman/datamule-python</a></p>
查看原文
doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files.
<a href="https://github.com/john-friedman/doc2dict">https://github.com/john-friedman/doc2dict</a><p>Speed:<p>* html - 500 pages per second single threaded.<p>* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.<p>Here's an example output from Microsoft's Annual Report:
> "title": "PART I",
"standardized_title": "parti",
"class": "part",
"contents": {
"38": {
"title": "ITEM 1. BUSINESS",
"standardized_title": "item1",
"class": "item",
"contents": {
"39": {
"title": "GENERAL",
"standardized_title": "",
"class": "predicted header",
"contents": {
"40": {
"title": "Embracing Our Future",
"standardized_title": "",
"class": "predicted header",
"contents": {
"41": {
"text": "Microsoft is a technolo...<p>Raw:
<a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing" rel="nofollow">https://html-preview.github.io/?url=https://raw.githubuserco...</a><p>Parsed dictionary:
<a href="https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json">https://github.com/john-friedman/doc2dict/blob/main/example_...</a><p>Simple description of algorithm:<p>* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as "bold", "font-size", etc and each line represents a new html block or line on a pdf.<p>* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.<p>Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these "mapping dicts".<p>doc2dict also includes visualization tools for the debugging process:<p>* visualize simplified representation
<a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-...</a><p>* visualize output dictionary
<a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-...</a><p>Why I made this:
I'm currently working on another open source python package to make it easy to exploit Securities & Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.<p>Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.<p>Link to other package (datamule)
<a href="https://github.com/john-friedman/datamule-python">https://github.com/john-friedman/datamule-python</a>