HackerNews中文版

我负责为一个包含1000万份文本文件（存储在PostgreSQL中）构建一个私有AI助手。目标是实现语义搜索和聊天，并要求定期进行增量更新。我正在考虑以下两种方案：前沿技术：实现类似LightRAG或GraphRAG的方案。成熟技术：使用标准的混合搜索（Weaviate/Elastic + 重新排序），通过像Dify这样的工具进行协调。对于那些在这个规模上构建过RAG的人：你们在2025年更倾向于使用哪种技术栈？对于如此大规模的数据，Graph/LightRAG的复杂性是否值得，相比于标准的分块/检索？你们是如何高效处理维护和更新的？期待你们的架构建议和经验分享。

查看原文

I'm tasked with building a private AI assistant for a corpus of 10 million text documents (living in PostgreSQL). The goal is semantic search and chat, with a requirement for regular incremental updates.I'm trying to decide between:Bleeding edge: Implementing something like LightRAG or GraphRAG.Proven stack: Standard Hybrid Search (Weaviate/Elastic + Reranking) orchestrated by tools like Dify.For those who have built RAG at this scale:What is your preferred stack for 2025?Is the complexity of Graph/LightRAG worth it over standard chunking/retrieval for this volume?How do you handle maintenance and updates efficiently?Looking for architectural advice and war stories.

请问HN：如果今天要为超过1000万份文档架构一个RAG系统，你会怎么做？