HackerNews中文版

嗨，HN，我们是 Shreyash 和 Bhavnick。我们开发了 Chonkie，这是一个用于文本和代码的高级分块和嵌入的开源库。之前它仅支持 Python，但我们刚刚发布了 TypeScript 版本： [https://github.com/chonkie-inc/chonkie-ts](https://github.com/chonkie-inc/chonkie-ts) 许多基于 JS/TS 的 AI 项目（例如使用 Vercel 的 AI SDK 或 Mastra 的项目）依赖于基本的文本分割器。但更好的分块意味着更好的检索，进而带来更好的性能。这正是 Chonkie 的设计初衷。当前的原生分块器（在 TS 中）： - 代码分块器 – 处理 Python、TypeScript 等语言 - 递归分块器 – 基于规则的层次分割 - 令牌分块器 – 按令牌数量分割（完全可定制） - 句子分块器 – 在句子边界处分割。分隔符可定制，因此适用于多种语言。所有分块器都支持自定义令牌化器、分块重叠、分隔符等功能。即将推出的原生 TS 功能（通过 API 客户端已可用）： - 语义分块器 – 在检测到意义变化时分割文本。 - SDPM 分块器 – 合并语义上相似的非重叠块 - 晚期分块器 – 为每个块生成上下文感知的嵌入 - 睡眠分块器 – LLM 精炼的递归块。显著减少令牌使用（从而降低成本），同时最大化块的质量。 - 嵌入精炼器 – 使用任何嵌入模型嵌入块 - 重叠精炼器 – 在连续块之间创建重叠，以更好地保留上下文。 Chonkie 是免费的、开源的，并且采用 MIT 许可证。GitHub: [https://github.com/chonkie-inc/chonkie-ts](https://github.com/chonkie-inc/chonkie-ts) 我们非常欢迎您的反馈、想法或贡献。谢谢！

查看原文

Hi HN,We’re Shreyash and Bhavnick. We built Chonkie, an open-source library for advanced chunking and embedding of text and code. It was previously Python-only, but we just released a TypeScript version: <a href="https://github.com/chonkie-inc/chonkie-ts">https://github.com/chonkie-inc/chonkie-ts</a>Many AI projects in JS/TS (like those using Vercel's AI SDK or Mastra) rely on basic text splitters. But better chunking = better retrieval = better performance. That’s what Chonkie is built for.Current native chunkers (in TS):- Code Chunker – handles Python, TypeScript, etc.- Recursive Chunker – rule-based, hierarchical splitting- Token Chunker – split by token count (fully customizable)- Sentence Chunker – split on sentence boundaries. Delimiters are customizable, so it works for multiple languages.All chunkers support custom tokenizers, chunk overlap, delimiters, and more.Coming soon in native TS (already available via the API client):- Semantic Chunker – splits texts wherever it detects a shift in meaning.- SDPM Chunker – merges semantically similar disjoint chunks- Late Chunker – generates context-aware embeddings for each chunk- Slumber Chunker – LLM-refined recursive chunks. Significantly reduces token usage (and thus cost) while maximizing chunk quality.- Embeddings Refinery - Embed chunks with any embedding model- Overlap Refinery – Create overlaps between consecutive chunks for better context preservation.Chonkie is free, open-source, and MIT licensed. GitHub: <a href="https://github.com/chonkie-inc/chonkie-ts">https://github.com/chonkie-inc/chonkie-ts</a>We’d love your feedback, ideas, or contributions. Thanks!

展示HN：使用Chonkie在JavaScript/TypeScript中实现高级分块处理