展示HN:用于测试解析器的开源合成银行对账单

2作者: Maesh6 天前原帖
我开源了一个包含5个合成银行和信用卡对账单PDF的数据集,旨在测试提取/解析的准确性。每个PDF都使用了来自不同国家的虚构银行,并采用了现实的格式。 我一直在构建一个银行对账单转换器(Bankstatemently),并不断发现不同银行的边缘案例。在某个时刻,我开始将这些情况归类为“特性”,目前我已经记录了36个挑战,并且还在不断增加(例如:跨年度的日期没有年份、信用卡费用以正数而非负数显示、日期隐藏在描述文本中等)。 真实的银行数据是私密的,因此没有共享的数据集可以用来测试解析器。一旦我收集了这些特性,我意识到可以利用它们重建故意包含这些挑战的对账单,以便更多人使用。 此外,还有一个免费的评估API:提交您的解析JSON并获取字段级的准确性评分。真实数据存储在服务器端,但这并不一定能防止过拟合。 我希望能收到关于缺失的边缘案例的反馈。我计划让接下来的10个对账单变得更具挑战性(扫描的PDF、多货币、多表格、佛教纪元日期)。 您可以在这里浏览所有特性及其真实世界的示例: [https://bankstatemently.com/benchmark/challenges](https://bankstatemently.com/benchmark/challenges)
查看原文
I open-sourced a dataset of 5 synthetic bank and credit card statement PDFs designed for testing extraction&#x2F;parsing accuracy. Each PDF uses a fictional bank with realistic formatting from a different country<p>I&#x27;ve been building a bank statement converter (Bankstatemently) and kept discovering edge cases across different banks. At some point, I started cataloging them as &quot;quirks&quot; and I&#x27;m currently at 36 documented challenges and counting (think: dates without years across year boundaries, credit card charges shown as positive instead of negative, dates hiding inside description text etc)<p>Real bank data is private, so there&#x27;s no shared dataset to test parsers against. Once I had these quirks, I realized I can use them to reconstruct statements that deliberately include these challenges so more people can use them<p>There&#x27;s also a free evaluation API: submit your parsed JSON and get field-level accuracy scores back. Ground truth is held server-side, but that&#x27;s not necessarily bullet-proof against overfitting<p>Would appreciate feedback on which edge cases are missing. I&#x27;m planning to make the next 10 statements a bit harder (scanned PDFs, multi-currency across multi-table, Buddhist era dates)<p><a href="https:&#x2F;&#x2F;github.com&#x2F;bankstatemently&#x2F;bank-statement-parsing-benchmark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;bankstatemently&#x2F;bank-statement-parsing-be...</a><p>You can browse all of the quirks here with real-world examples: <a href="https:&#x2F;&#x2F;bankstatemently.com&#x2F;benchmark&#x2F;challenges" rel="nofollow">https:&#x2F;&#x2F;bankstatemently.com&#x2F;benchmark&#x2F;challenges</a>