HackerNews中文版

大家好！我们是 Shreyash 和 Bhavnick。我们正在开发 Chonkie（<a href="https://chonkie.ai">https://chonkie.ai</a>），一个用于数据分块和嵌入的开源库。 Python 版本：<a href="https://github.com/chonkie-inc/chonkie">https://github.com/chonkie-inc/chonkie</a> TypeScript 版本：<a href="https://github.com/chonkie-inc/chonkie-ts">https://github.com/chonkie-inc/chonkie-ts</a> 这里有一个展示我们代码分块器的视频：<a href="https://youtu.be/Xclkh6bU1P0" rel="nofollow">https://youtu.be/Xclkh6bU1P0</a>。 Bhavnick 和我已经用大语言模型（LLMs）构建个人项目几年了。在这段时间里，我们发现自己经常需要编写自己的分块逻辑来支持 RAG 应用程序。我们常常犹豫使用现有的库，因为它们要么功能简单，要么显得过于臃肿（有些库超过 80MB）。我们构建 Chonkie 的目标是轻量、快速、可扩展且易于使用。这个领域发展迅速，我们希望 Chonkie 能够快速支持最新的策略。目前我们支持：令牌分块、句子分块、递归分块、语义分块，以及： - 语义双重分块：首先对文本进行语义分块，然后合并紧密相关的块。 - 代码分块：通过创建抽象语法树（AST）并找到理想的分割点来分块代码文件。 - 延迟分块：基于论文（<a href="https://arxiv.org/abs/2409.04701" rel="nofollow">https://arxiv.org/abs/2409.04701</a>），从嵌入较长文档中导出块嵌入。 - Slumber 分块：基于“Lumber Chunking”论文（<a href="https://arxiv.org/abs/2406.17526" rel="nofollow">https://arxiv.org/abs/2406.17526</a>）。它使用递归分块，然后由 LLM 验证分割点，旨在以减少令牌使用和 LLM 成本的方式生成高质量的块。您可以在我们的基准测试中查看 Chonkie 与 LangChain 和 LlamaIndex 的比较：<a href="https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS.md">https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....</a> 关于 Chonkie 包的一些技术细节： - 默认安装约 15MB，而某些替代品则在 80-170MB 之间。 - 在我们的测试中，令牌分块速度比 LangChain 和 LlamaIndex 快多达 33 倍。 - 与主要的分词器（transformers、tokenizers、tiktoken）兼容。 - 基本功能零外部依赖。 - 实现了积极的缓存和预计算。 - 使用运行均值池化进行高效的语义分块。 - 模块化依赖系统（仅安装所需的部分）。除了分块，Chonkie 还提供了一种简单的方式来创建嵌入。对于支持的提供者（SentenceTransformer、Model2Vec、OpenAI），您只需将模型名称作为字符串指定。您还可以为其他提供者创建自定义嵌入处理程序。目前 RAG 仍然是最常见的用例。然而，Chonkie 生成的块经过优化，适用于创建高质量的嵌入和向量检索，因此并不完全依赖于 RAG 的“生成”部分。事实上，我们看到越来越多的人使用 Chonkie 来实现语义搜索和/或为代理设置上下文。我们目前专注于构建集成，以简化检索过程。我们创建了“握手”——与 pgVector、Chroma、TurboPuffer 和 Qdrant 等向量数据库交互的轻量函数，使您能够轻松与存储进行交互。如果您希望看到某个集成（无论是向量数据库还是其他），请告诉我们。我们还提供托管和本地版本，包含 OCR、额外的元数据、所有嵌入提供者，以及为希望拥有完全托管管道的团队管理的向量数据库。如果您感兴趣，请通过 shreyash@chonkie.ai 联系我们或预约演示：<a href="https://cal.com/shreyashn/chonkie-demo" rel="nofollow">https://cal.com/shreyashn/chonkie-demo</a>。我们期待您的反馈和评论！谢谢！

查看原文

Hey HN! We're Shreyash and Bhavnick. We're building Chonkie (<a href="https://chonkie.ai">https://chonkie.ai</a>), an open-source library for chunking and embedding data.Python: <a href="https://github.com/chonkie-inc/chonkie">https://github.com/chonkie-inc/chonkie</a>TypeScript: <a href="https://github.com/chonkie-inc/chonkie-ts">https://github.com/chonkie-inc/chonkie-ts</a>Here's a video showing our code chunker: <a href="https://youtu.be/Xclkh6bU1P0" rel="nofollow">https://youtu.be/Xclkh6bU1P0</a>.Bhavnick and I have been building personal projects with LLMs for a few years. For much of this time, we found ourselves writing our own chunking logic to support RAG applications. We often hesitated to use existing libraries because they either had only basic features or felt too bloated (some are 80MB+).We built Chonkie to be lightweight, fast, extensible, and easy. The space is evolving rapidly, and we wanted Chonkie to be able to quickly support the newest strategies. We currently support: Token Chunking, Sentence Chunking, Recursive Chunking, Semantic Chunking, plus:- Semantic Double Pass Chunking: Chunks text semantically first, then merges closely related chunks.- Code Chunking: Chunks code files by creating an AST and finding ideal split points.- Late Chunking: Based on the paper (<a href="https://arxiv.org/abs/2409.04701" rel="nofollow">https://arxiv.org/abs/2409.04701</a>), where chunk embeddings are derived from embedding a longer document.- Slumber Chunking: Based on the "Lumber Chunking" paper (<a href="https://arxiv.org/abs/2406.17526" rel="nofollow">https://arxiv.org/abs/2406.17526</a>). It uses recursive chunking, then an LLM verifies split points, aiming for high-quality chunks with reduced token usage and LLM costs.You can see how Chonkie compares to LangChain and LlamaIndex in our benchmarks: <a href="https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS.md">https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....</a>Some technical details about the Chonkie package: - ~15MB default install vs. ~80-170MB for some alternatives. - Up to 33x faster token chunking compared to LangChain and LlamaIndex in our tests. - Works with major tokenizers (transformers, tokenizers, tiktoken). - Zero external dependencies for basic functionality. - Implements aggressive caching and precomputation. - Uses running mean pooling for efficient semantic chunking. - Modular dependency system (install only what you need).In addition to chunking, Chonkie also provides an easy way to create embeddings. For supported providers (SentenceTransformer, Model2Vec, OpenAI), you just specify the model name as a string. You can also create custom embedding handlers for other providers.RAG is still the most common use case currently. However, Chonkie makes chunks that are optimized for creating high quality embeddings and vector retrieval, so it is not really tied to the "generation" part of RAG. In fact, We're seeing more and more people use Chonkie for implementing semantic search and/or setting context for agents.We are currently focused on building integrations to simplify the retrieval process. We've created "handshakes" – thin functions that interact with vector DBs like pgVector, Chroma, TurboPuffer, and Qdrant, allowing you to interact with storage easily. If there's an integration you'd like to see (vector DB or otherwise), please let us know.We also offer hosted and on-premise versions with OCR, extra metadata, all embedding providers, and managed vector databases for teams that want a fully managed pipeline. If you're interested, reach out at shreyash@chonkie.ai or book a demo: <a href="https://cal.com/shreyashn/chonkie-demo" rel="nofollow">https://cal.com/shreyashn/chonkie-demo</a>.We're eager to hear your feedback and comments! Thanks!

发布 HN：Chonkie（YC X25）– 高级分块的开源库