HackerNews中文版

doc2dict是一个Python包，可以将HTML和PDF文档转换为保留层次结构的字典。它还支持从HTML文件中提取表格。 <a href="https://github.com/john-friedman/doc2dict">https://github.com/john-friedman/doc2dict</a> 速度： * HTML - 单线程每秒处理500页。 * PDF - 每秒处理200页，PDF必须具有底层文本结构。由于PDFium的限制，无法进行多线程处理。 以下是微软年度报告的示例输出： > "title": "PART I", "standardized_title": "parti", "class": "part", "contents": { "38": { "title": "ITEM 1. BUSINESS", "standardized_title": "item1", "class": "item", "contents": { "39": { "title": "GENERAL", "standardized_title": "", "class": "predicted header", "contents": { "40": { "title": "Embracing Our Future", "standardized_title": "", "class": "predicted header", "contents": { "41": { "text": "Microsoft is a technolo...原始数据： <a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing" rel="nofollow">https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing</a> 解析后的字典： <a href="https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json">https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json</a> 算法的简单描述： * 处理复杂文档，如PDF或HTML，并为其创建一个简化的表示，形式为字典的列表，其中每个字典是一个文本块，包含“粗体”、“字体大小”等关键特征，每一行表示一个新的HTML块或PDF中的一行。 * 使用一组预定规则将简化的表示转换为字典，例如，较小的字体大小表示标题应嵌套在较大字体大小的标题下。 请注意，我正在努力使最后一部分更加模块化，创建用户可以根据其用例调整的预定指令，而无需重写解析器。我称这些为“映射字典”。 doc2dict还包括调试过程的可视化工具： * 可视化简化表示 <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html</a> * 可视化输出字典 <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html</a> 我为什么要做这个： 我目前正在开发另一个开源Python包，以便更容易地利用证券交易委员会的数据。编写一个可以调整的通用文档解析器比为每种文档类型编写100个专用解析器要容易得多。 此外，将HTML和PDF文件转换为字典表示可以将文档大小减少约10倍。不确定我可以用这个做什么，但计划进行一些有趣的NoSQL数据库实验。 其他包的链接（datamule） <a href="https://github.com/john-friedman/datamule-python">https://github.com/john-friedman/datamule-python</a>

查看原文

doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files. <a href="https://github.com/john-friedman/doc2dict">https://github.com/john-friedman/doc2dict</a>Speed:* html - 500 pages per second single threaded.* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.Here's an example output from Microsoft's Annual Report: > "title": "PART I", "standardized_title": "parti", "class": "part", "contents": { "38": { "title": "ITEM 1. BUSINESS", "standardized_title": "item1", "class": "item", "contents": { "39": { "title": "GENERAL", "standardized_title": "", "class": "predicted header", "contents": { "40": { "title": "Embracing Our Future", "standardized_title": "", "class": "predicted header", "contents": { "41": { "text": "Microsoft is a technolo...Raw: <a href="https://html-preview.github.io/?url=https://raw.githubusercontent.com/john-friedman/doc2dict/refs/heads/main/example_output/html/msft_10k_2024.html#:~:text=embracing" rel="nofollow">https://html-preview.github.io/?url=https://raw.githubuserco...</a>Parsed dictionary: <a href="https://github.com/john-friedman/doc2dict/blob/main/example_output/html/dict.json">https://github.com/john-friedman/doc2dict/blob/main/example_...</a>Simple description of algorithm:* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as "bold", "font-size", etc and each line represents a new html block or line on a pdf.* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these "mapping dicts".doc2dict also includes visualization tools for the debugging process:* visualize simplified representation <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/instructions_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-...</a>* visualize output dictionary <a href="https://html-preview.github.io/?url=https://github.com/john-friedman/doc2dict/blob/main/example_output/html/document_visualization.html" rel="nofollow">https://html-preview.github.io/?url=https://github.com/john-...</a>Why I made this: I'm currently working on another open source python package to make it easy to exploit Securities & Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.Link to other package (datamule) <a href="https://github.com/john-friedman/datamule-python">https://github.com/john-friedman/datamule-python</a>

展示HN：Doc2dict，一个快速的开源文档转字典转换器 - 不使用人工智能