展示HN:免费API提取PDF数据
嗨,HN,
和大家一样,我正在开发一个使用大型语言模型(LLMs)从照片和文档中提取数据的产品。处理流程的一部分是从PDF中提取原始文本或光栅图像。
作为我们潜在客户开发策略的一部分,我们开放了一个REST API,允许您处理PDF的页面。该API可以匿名免费使用,但限制为每30秒处理1页。创建一个免费账户可以解除此限制。
这两个端点是:
- <a href="https://extract.dev/api/pages/extract/raster" rel="nofollow">https://extract.dev/api/pages/extract/raster</a> - 将PDF的一页转换为光栅图像
- <a href="https://extract.dev/api/pages/extract/text" rel="nofollow">https://extract.dev/api/pages/extract/text</a> - 从PDF的一页中提取文本
这两个接口的请求格式相同:
```json
{
"file": "https://assets.extract-cdn.com/data/hd-receipt.pdf",
"page": 1
}
```
我在这里概述了更多文档:<a href="https://extract.dev/docs" rel="nofollow">https://extract.dev/docs</a>
在后台,API使用Poppler来提取文本和光栅化页面。请注意,文本提取功能提取的是PDF中实际编码的文本,而不使用OCR模型。欢迎试用,如果您觉得这个工具有用,我很期待您的反馈。
查看原文
Hi HN,<p>Like everyone, I'm working on an product that uses LLMs to extract data from photos and documents. Part of the processing pipeline is extracting data from PDFs as raw text or a raster image.<p>As part of our leadgen strategy, we've opened our REST API that lets you process pages of a PDF. The API is completely free to use anonymously, but is rate limited to 1 page per 30 seconds. Creating a free account removes this restriction.<p>The two endpoints are:<p>- <a href="https://extract.dev/api/pages/extract/raster" rel="nofollow">https://extract.dev/api/pages/extract/raster</a> - Rasterize a page of a PDF<p>- <a href="https://extract.dev/api/pages/extract/text" rel="nofollow">https://extract.dev/api/pages/extract/text</a> - Extract text from a page of a PDF<p>Both have the same request format:<p><pre><code> {
"file": "https://assets.extract-cdn.com/data/hd-receipt.pdf",
"page": 1
}
</code></pre>
I've outlined more of the documentation here: <a href="https://extract.dev/docs" rel="nofollow">https://extract.dev/docs</a><p>Under the hood, the API is using Poppler to extract text and rasterize pages. Note that the text extraction functionality extracts actual text encoded in the PDF, and does not employ an OCR model. Give it a spin, I'm interested in your feedback if this is useful or not.