我构建了一个包含20万个边缘的市场知识图谱,以过滤虚假的低买信号。

1作者: gano大约 1 个月前原帖
我一直在尝试一种基于图的方式来解决一个经典的交易问题:为什么大多数逢低买入策略无法区分暂时的过度反应和真正的结构性崩溃。 大多数系统对-5%的价格变动无论上下文如何都采取相同的处理。我假设公司在市场结构中的位置比价格变动本身更为重要。 工程思路 我构建了一个美国公共市场的知识图谱,包含约207,000条边,跨越约21种关系类型,分为四个层次: - 操作层:供应链关系(SUPPLIES_TO,PRODUCES) - 流动层:ETF和机构持股结构 - 社会层:董事会交叉持股(SHARES_DIRECTOR_WITH) - 环境层:地理位置/竞争 对于每一层,我使用类似PageRank的方法计算中心性得分(采用逆度加权以避免ETF超级节点的主导影响)。 这些结构特征随后与基本的价格/交易量背景结合,并输入到基于树的模型(XGBoost)中,以在剧烈回撤后对股票进行排名。 让我感到惊讶的是 当我在样本外验证排名时(2024-2025年,使用Alphalens避免前瞻性问题): - 操作层和流动层的边缘提供了大部分提升 - 社会层的边缘(董事会交叉持股)带来的增益远低于我的预期 - 与仅使用价格基线相比,图特征大致提高了排名质量一倍 这在我开始时并不明显——我原本预计“社会”连接会更重要。 我为什么要发布 我正在将这项研究笔记转变为一个生产仪表板,在我确定图谱模式之前,希望能得到在其他领域构建大型图谱的人的反馈。特别是: - 你是否在其他地方看到董事会交叉持股/社会边缘具有预测性? - 在这个规模上,你发现有哪些图谱规范化技巧是必不可少的? - 在混合异构边缘类型时,你遇到过哪些陷阱? 欢迎提问关于图谱构建、中心性计算或验证设置的相关问题。
查看原文
I’ve been experimenting with a graph-based approach to a classic trading problem: why most dip-buying strategies can’t tell the difference between a temporary overreaction and a genuine structural collapse.<p>Most systems treat a −5% move the same regardless of context. My hypothesis was that where a company sits in the market’s structure matters more than the price move itself.<p>The engineering idea<p>I built a knowledge graph of the U.S. public markets with ~207k edges across ~21 relationship types, organized into four layers:<p>Operational: supply-chain relationships (SUPPLIES_TO, PRODUCES)<p>Flow: ETF and institutional ownership plumbing<p>Social: board interlocks (SHARES_DIRECTOR_WITH)<p>Environmental: geography &#x2F; competition<p>For each layer, I compute centrality scores using PageRank-style methods (with inverse-degree weighting to avoid ETF super-nodes dominating).<p>These structural features are then combined with basic price&#x2F;volume context and fed into a tree-based model (XGBoost) to rank stocks after sharp drawdowns<p>What surprised me<p>When I validated the rankings out-of-sample (2024–2025, using Alphalens to avoid look-ahead issues): * Operational and Flow edges provided most of the lift * Social edges (board interlocks) added much less than I expected * Graph features roughly doubled ranking quality versus price-only baselines This wasn’t obvious to me going in — I expected “social” connections to matter more.<p>Why I’m posting<p>I’m in the process of turning this from a research notebook into a production dashboard, and before I lock in the graph schema I’d love feedback from people who’ve built large graphs in other domains. In particular: * Have you seen board-interlock &#x2F; social edges be predictive elsewhere? * Are there graph normalization tricks you’ve found essential at this scale? * Any pitfalls you’ve hit when mixing heterogeneous edge types?<p>Happy to answer questions about the graph construction, centrality calculations, or validation setup.