问HN:你是如何通过编程方式跟踪SEC(证券交易委员会)文件的变更的?

1作者: HupDup8 个月前原帖
我正在进行一个项目,旨在通过跟踪美国证券交易委员会(SEC)文件中语言的变化,分析企业战略的演变。 例如,我想回答这样的问题:“主要云服务提供商何时首次开始将‘人工智能’识别为核心业务驱动力?相关语言每年是如何变化的?”或者“自2022年以来,SaaS公司的‘风险因素’部分的情感如何从‘增长’转变为‘效率’?” 简单的方法似乎是下载所有文件,转换为文本,并使用关键词搜索,但这显得脆弱,且缺乏语义上下文。对分块文档进行向量搜索更好,但处理表格并在十年的报告中保持上下文并非易事。 对于那些处理过这种非结构化时间序列文本数据的人来说,有哪些最有效的技术或不明显的挑战? 我正在试图弄清楚这是否是一个真正困难的数据科学问题,还是有我忽视的成熟解决方案。
查看原文
I'm working on a project that requires analyzing the evolution of corporate strategy by tracking changes in language within SEC filings over several years. For example, I want to answer questions like: "When did major cloud providers first start identifying 'AI' as a core business driver, and how did the surrounding language change each year?" or "How has the sentiment of the 'Risk Factors' section for SaaS companies shifted from 'growth' to 'efficiency' since 2022?" The naive approach seems to be downloading all filings, converting to text, and using keyword searches, but this feels brittle and misses semantic context. Vector search on chunked documents is better, but handling tables and maintaining context across a decade of reports is non-trivial. For those who have worked with this kind of unstructured, time-series text data, what are the most effective techniques or non-obvious challenges? I'm trying to figure out if this is a genuinely hard data science problem or if there are established solutions I'm overlooking.