展示HN:我跟踪了23个Substack上的3,519个股票推荐——谁在赚钱?

1作者: lineudemonia大约 1 个月前原帖
我订阅了23份付费投资通讯,来自Substack(约9600美元/年)。由于无法及时阅读所有内容,我建立了一个系统来提取和评估每个股票推荐。 *管道流程:* - 从Substack抓取文章 - 使用Gemini的结构化输出提取高信心股票推荐——过滤掉随意提及的股票代码,仅计算作者进行深入分析、提供具体数据或价格目标的推荐 - 使用yfinance跟踪发布后1天、7天、15天、30天和60天的回报 - 计算相对于特定行业ETF基准的阿尔法(半导体使用SOXX,软件即服务使用IGV,金融使用XLF,日本使用EWJ,SPY作为备用) - 去重:同一作者在14天内提及同一股票代码 = 一次推荐。跨作者的推荐视为独立 总数据集:来自22位作者在一年内的3519个高信心推荐。 *有趣的技术挑战:* 1. *AI提取准确性。* Gemini在识别作者是否在做出真实推荐与仅仅随口提及股票代码方面表现得相当出色。我们对推荐进行信心等级(高/低)和方向(看涨/看跌)标记。为了验证这一点,我们进行了抽查,并与人工阅读结果和其他模型输出进行了交叉验证。虽然不是完美的,但一致性足够高,具有实用性。 2. *自定义域名处理。* 许多Substack作者使用自定义域名(例如,collyerbridge.com,lordfed.co.uk),这有时会触发Cloudflare的挑战。当标准HTTP客户端被阻止时,我们会回退到无头Playwright。 3. *基准选择。* 在牛市中,简单的“股票是否上涨?”指标毫无意义。我们将每个股票代码映射到行业ETF基准,因此阿尔法 = 持仓回报减去同一时期的基准回报。这将真正的选股能力与仅仅在上涨市场中持有股票区分开来。 4. *去重逻辑。* 作者经常在多篇文章中重提相同的论点。如果不去重,单一股票在5篇文章中提及将被计算为5个独立的“推荐”。我们对每位作者每个股票代码使用14天的窗口期——只有第一次提及算数。 *一些发现(供参考,并非本文重点):* - 表现最佳的推荐在30天时平均回报+14.9%,在60天时平均回报+26.7% - 最昂贵的通讯(每年1000美元以上)并不是表现最好的 - 推荐数量较少、目标更明确的作者(15-80个推荐)往往优于那些有300+个推荐的作者 - 30天与60天的排名差异显著——深度价值投资者在较长时间范围内表现得更好 - 短期推荐对几乎所有人来说都更具挑战性 *技术栈:* Python, SQLite, Gemini API(结构化输出), yfinance, Playwright(可选) 我写了一篇更详细的分析,附有图表,作为X线程: [链接](https://x.com/pyhrroll/status/2027374283669066045?s=20) 欢迎讨论方法论、架构,或分享提取提示。如果有兴趣查看代码,管道大约有2000行Python。
查看原文
I subscribe to 23 paid investment newsletters on Substack (~$9,600&#x2F;year). I couldn&#x27;t keep up with reading them all, so I built a system to extract and evaluate every stock pick.<p>*The pipeline:*<p>- Crawls articles from Substack - Extracts high-conviction stock picks using Gemini&#x27;s structured output — filters out casual ticker mentions and only counts calls where the author dedicates real analysis, specific data, or price targets - Tracks returns at 1d, 7d, 15d, 30d, and 60d post-publication using yfinance - Calculates alpha vs sector-specific ETF benchmarks (SOXX for semis, IGV for SaaS, XLF for financials, EWJ for Japan, SPY as fallback) - Deduplication: same author, same ticker within 14 days = one call. Cross-author calls are independent<p>Total dataset: 3,519 high-conviction calls from 22 authors over 1 year.<p>*Interesting technical challenges:*<p>1. <i>AI extraction accuracy.</i> Gemini is surprisingly good at identifying whether an author is making a real call vs. just mentioning a ticker in passing. We tag calls with conviction level (high&#x2F;low) and direction (bullish&#x2F;bearish). To validate this, we spot-checked against manual reads and cross-verified with alternative model outputs. Not perfect, but consistent enough to be useful.<p>2. <i>Custom domain handling.</i> Many Substack authors use custom domains (e.g., collyerbridge.com, lordfed.co.uk) which sometimes trigger Cloudflare challenges. We fall back to headless Playwright when the standard HTTP client gets blocked.<p>3. <i>Benchmark selection.</i> A naive &quot;did the stock go up?&quot; metric is meaningless in a bull market. We map each ticker to a sector ETF benchmark, so alpha = position return minus benchmark return over the same period. This separates genuine stock-picking skill from just being long in a rising market.<p>4. <i>Deduplication logic.</i> Authors often revisit the same thesis across multiple articles. Without dedup, a single stock mentioned in 5 articles would count as 5 independent &quot;calls.&quot; We use a 14-day window per author per ticker — only the first mention counts.<p>*Some findings (for context, not the point of this post):*<p>- Top performer averaged +14.9% at 30d and +26.7% at 60d on long calls - The most expensive newsletters ($1,000+&#x2F;year) were not the best performers - Authors with fewer, more targeted calls (15-80) tended to outperform those with 300+ calls - 30d vs 60d rankings shift significantly — deep value investors look much better at longer horizons - Short calls were harder for almost everyone<p>*Stack:* Python, SQLite, Gemini API (structured output), yfinance, Playwright (optional)<p>I wrote a more detailed breakdown with charts as an X thread: <a href="https:&#x2F;&#x2F;x.com&#x2F;pyhrroll&#x2F;status&#x2F;2027374283669066045?s=20" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;pyhrroll&#x2F;status&#x2F;2027374283669066045?s=20</a><p>Happy to discuss the methodology, architecture, or share the extraction prompts. The pipeline is ~2,000 lines of Python if there&#x27;s interest in seeing the code.