我11步的GraphRAG流程,哪些有效,哪些仍然存在问题

1作者: pauliusztin16 天前原帖
在为一家旧金山初创公司构建金融助手的过程中,我们发现AI框架增加了复杂性,却没有带来价值。当我开始使用GraphRAG构建个人助手时,我铭记了这一教训,但仍然尝试了LangChain的MongoDBGraphStore。它让我在10分钟内得到了一个可用的知识图谱。 接着,我查看了数据。我从仅5个文档中获得了17种节点类型和34种关系类型,包括三种“部分”的版本。GraphRAG是一个数据建模问题,而不是检索问题。 附图展示了我最终得到的完整11步流程。以下是每一步所能学到的内容的详细介绍。 基本上,在数据流程的第1步和第2步中,原始数据源经过提取、转换和加载(ETL)过程,最终以文档的形式存储在MongoDB数据仓库中。每个文档存储了源类型、URI、内容和元数据。 然后在第3步中,我们清理文档并将其拆分为以令牌为界限的块。我们从512个令牌开始,重叠部分为64个令牌。不过,我们还需要对此进行更多测试。 关键是,第4步处理图的提取。我们定义了严格的本体论。本体论只是一个正式的合同,精确定义了数据中存在的类别和关系。我们使用了6种节点类型和8种边类型。大型语言模型(LLM)只能提取本体论允许的内容。 例如,如果它输出一个PERSON到TASK的连接,并带有EXPERIENCED边,流程将拒绝它。EXPERIENCED必须将PERSON连接到EPISODE。 我们还将LLM提取与确定性提取分开。我们创建了像Document或Chunk节点这样的结构性条目,而不需要LLM调用。 结果发现,第5步的规范化是最困难的部分。我们使用了三阶段的去重过程。我们进行内存中的模糊匹配、针对MongoDB的跨文档解析和边的重映射。 无论如何,在第6步中,我们批量嵌入节点。系统使用模拟进行测试,开发时使用Sentence Transformers,生产时使用Voyage API。 最终,在第7步和第8步中,节点和边作为统一的内存存储在单个MongoDB集合中。我们使用确定性的字符串ID,如“person:alice”,以防止重复。MongoDB在一个聚合管道中处理文档、$vectorSearch、$text和$graphLookup。$graphLookup函数可以直接在数据库中遍历连接的图数据。对于大多数代理用例,您不需要Neo4j + Pinecone + Postgres。像MongoDB这样的单一数据库就能很好地完成任务。通过分片,您可以将其扩展到十亿条记录。 总结一下,第9步到第11步涵盖了检索。代理通过MCP服务器调用工具。它使用混合向量、文本和图扩展的搜索记忆,以及自然语言到MongoDB聚合的查询记忆。代理还使用摄取工具将数据写回数据库,以便进行持续学习。 以下是我仍在努力解决的一些问题,非常希望听到您的意见: 1. 您是如何处理跨文档的实体/关系解析的? 2. 使用LLM优化实体/关系提取时,最有帮助的是什么? 3. 在图更新后,您如何保持嵌入的一致性? 此外,在构建我的个人助手的过程中,过去几个月我在LinkedIn上撰写了关于这个系统的文章。以下是深入探讨每个部分的帖子: - 运行嵌入模型的三种方法:[链接](https://www.linkedin.com/feed/update/urn:li:activity:7443288346153480192) - LangChain让我在10分钟内得到了知识图谱:[链接](https://www.linkedin.com/feed/update/urn:li:activity:7440751582381494272) - Palantir在以本体论为先的AI上建立了4000亿美元的帝国:[链接](https://www.linkedin.com/feed/update/urn:li:activity:7434591082367320064) - 数字双胞胎代理的摄取架构:[链接](https://www.linkedin.com/feed/update/urn:li:activity:7432054336589021184) - 大多数AI代理不需要三个数据库:[链接](https://www.linkedin.com/feed/update/urn:li:activity:7426981104227856385)
查看原文
While building a financial assistant for an SF start-up, we learned that AI frameworks add complexity without value. When I started building a personal assistant with GraphRAG, I carried that lesson but still tried LangChain&#x27;s MongoDBGraphStore. It gave me a working knowledge graph in 10 minutes.<p>Then I looked at the data. I had 17 node types and 34 relationship types from just 5 documents, including three versions of &quot;part of&quot;. GraphRAG is a data modeling problem, not a retrieval problem.<p>The attached diagram shows the full 11-step pipeline I ended up with. Here is a walkthrough of what you can learn from each step.<p>So basically, in steps 1 and 2 of the data pipeline, raw sources go through an Extract, Transform, Load (ETL) process. They land as documents in a MongoDB data warehouse. Each document stores the source type, URI, content, and metadata.<p>Then in step 3, we clean the documents and split them into token-bounded chunks. We started with 512 tokens with a 64-token overlap. Still, we have to run more tests on this.<p>The thing is, step 4 handles graph extraction. We defined a strict ontology. An ontology is just a formal contract defining exactly what categories and relationships exist in your data. We used 6 node types and 8 edge types. The LLM can only extract what this ontology allows.<p>For example, if it outputs a PERSON to TASK connection with an EXPERIENCED edge, the pipeline rejects it. EXPERIENCED must connect a PERSON to an EPISODE.<p>We also split LLM extraction from deterministic extraction. We create structural entries like Document or Chunk nodes without LLM calls.<p>Turns out, step 5 for normalization is the hardest part. We use a three-phase deduplication process. We do in-memory fuzzy matching, cross-document resolution against MongoDB, and edge remapping.<p>Anyway, in step 6, we batch embed the nodes. The system uses a mock for tests, Sentence Transformers for development, and the Voyage API for production.<p>Ultimately, in steps 7 and 8, nodes and edges are stored in a single MongoDB collection as unified memory. We use deterministic string IDs like &quot;person:alice&quot; to prevent duplicates. MongoDB handles documents, $vectorSearch, $text, and $graphLookup in one aggregation pipeline. The $graphLookup function natively traverses connected graph data directly in the database. You don&#x27;t need Neo4j + Pinecone + Postgres for most agent use cases. A single database like MongoDB gets the job done really well. Through sharding, you can scale it up to a billion records.<p>To wrap it up, steps 9 through 11 cover retrieval. The agent calls tools through an MCP server. It uses search memory with hybrid vector, text, and graph expansion, alongside query memory for natural language to MongoDB aggregation. The agent also uses ingest tools to write back to the database for continual learning.<p>Here are a few things I am still struggling with and would love your opinion on:<p>How are you handling entity&#x2F;relationship resolution across documents?<p>What helped you the most to optimize the extraction of entities&#x2F;relationships using LLMs?<p>How do you keep embeddings in sync after graph updates?<p>Also, while building my personal assistant, I have been writing about this system on LinkedIn over the past few months. Here are the posts that go deeper into each piece:<p>- 3 ways to run embedding models: https:&#x2F;&#x2F;www.linkedin.com&#x2F;feed&#x2F;update&#x2F;urn:li:activity:7443288346153480192<p>- LangChain gave me a knowledge graph in 10 minutes: https:&#x2F;&#x2F;www.linkedin.com&#x2F;feed&#x2F;update&#x2F;urn:li:activity:7440751582381494272<p>- Palantir built a $400B empire on ontology-first AI: https:&#x2F;&#x2F;www.linkedin.com&#x2F;feed&#x2F;update&#x2F;urn:li:activity:7434591082367320064<p>- Ingestion architecture for Digital Twin agent: https:&#x2F;&#x2F;www.linkedin.com&#x2F;feed&#x2F;update&#x2F;urn:li:activity:7432054336589021184<p>- Most AI agents don&#x27;t need three databases: https:&#x2F;&#x2F;www.linkedin.com&#x2F;feed&#x2F;update&#x2F;urn:li:activity:7426981104227856385