为什么您的RAG每月花费2400美元(以及我们如何将其削减73%)
你在生产环境中运行RAG(检索增强生成)。然后,AWS账单来了。每月$2,400,50个查询/天。每个查询$48。
我们为企业客户构建了一个RAG系统,并意识到大多数生产环境中的RAG都是优化灾难。文献过于关注准确性,而完全忽视了单位经济学。
### 三个成本类别
**向量数据库(账单的40-50%)**
标准的RAG流程每个问题会进行3-5次不必要的数据库查询。我们原本应该只进行1.5次,却进行了5次往返。
**大型语言模型API(30-40%)**
标准的RAG向大型语言模型输入8,000到15,000个标记。这是必要数量的5-10倍。我们发现:超过3,000个标记的上下文,准确性就会达到饱和。超出部分只是噪音和成本。
**基础设施(15-25%)**
向量数据库闲置,监控开销,不必要的负载均衡。
### 实际推动变化的因素
**基于标记的上下文(节省35%)**
基于预算的组装,当使用的标记足够时停止。之前:每个查询12,000个标记。之后:3,200个标记。准确性保持不变。
```python
def _build_context(self, results, settings):
max_tokens = settings.get("max_context_tokens", 2000)
current_tokens = 0
for result in results:
tokens = self.llm.count_tokens(result)
if current_tokens + tokens <= max_tokens:
current_tokens += tokens
else:
break
```
**混合重排序(节省25%)**
70%的语义评分 + 30%的关键词评分。更好的排名意味着需要的块更少。从前20个检索到前8个,同时保持质量。
**嵌入缓存(节省20%)**
工作区隔离的缓存,TTL为7天。我们看到日内命中率为45-60%。
```python
async def set_embedding(self, text, embedding, workspace_id=None):
key = f"embedding:ws_{workspace_id}:{hash(text)}"
await redis.setex(key, 604800, json.dumps(embedding))
```
**批量嵌入(节省15%)**
批量API定价每个标记便宜30-40%。同时处理50个文本,而不是逐个处理。
查看原文
You're running RAG in production. Then the AWS bill lands. $2,400/month for 50 queries/day. $48 per query.<p>We built a RAG system for enterprise clients and realized most production RAGs are optimization disasters. The literature obsesses over accuracy while completely ignoring unit economics.<p>The Three Cost Buckets
Vector Database (40-50% of bill)
Standard RAG pipelines do 3-5 unnecessary DB queries per question. We were making 5 round-trips for what should've been 1.5.<p>LLM API (30-40%)
Standard RAG pumps 8-15k tokens into the LLM. That's 5-10x more than necessary. We found: beyond 3,000 tokens of context, accuracy plateaus. Everything beyond that is noise and cost.<p>Infrastructure (15-25%)
Vector databases sitting idle, monitoring overhead, unnecessary load balancing.<p>What Actually Moved the Needle
Token-Aware Context (35% savings)
Budget-based assembly that stops when you've used enough tokens. Before: 12k tokens/query. After: 3.2k tokens. Same accuracy.<p>python
def _build_context(self, results, settings):
max_tokens = settings.get("max_context_tokens", 2000)
current_tokens = 0
for result in results:
tokens = self.llm.count_tokens(result)
if current_tokens + tokens <= max_tokens:
current_tokens += tokens
else:
break
Hybrid Reranking (25% savings)
70% semantic + 30% keyword scoring. Better ranking means fewer chunks needed. Top-20 → top-8 retrieval while maintaining quality.<p>Embedding Caching (20% savings)
Workspace-isolated cache with 7-day TTL. We see 45-60% hit rate intra-day.<p>python
async def set_embedding(self, text, embedding, workspace_id=None):
key = f"embedding:ws_{workspace_id}:{hash(text)}"
await redis.setex(key, 604800, json.dumps(embedding))
Batch Embedding (15% savings)
Batch API pricing is 30-40% cheaper per token. Process 50 texts simultaneously instead of individu