展示HN:修复一个单指针错误解锁了在Windows上解析超过100万行JSON的能力

3作者: hilti3 个月前原帖
我一直在构建一个跨平台的 JSONL 查看器应用程序,能够处理多个 GB 的文件。在我的开发机器 macOS 上运行得非常完美,但在 Windows 上却在 2,650 KB 时总是崩溃。以下是我的调试过程以及那个微小的修复,它改变了一切。 <p>问题</p> - macOS:轻松处理超过 5GB 的文件 - Windows:每次在 2,650 KB 时崩溃 - 相同的代码库,从 Mac Silicon 使用 MinGW 交叉编译到 Windows <p>调查过程</p> 添加了详细的日志记录以跟踪执行情况。崩溃发生在成功解析约 6,000 行后进行字符串驻留时。不是在解析期间,也不是在文件 I/O 期间,而是在合并阶段。 <p>根本原因</p> 我的 StringPool 类使用了 `std::unordered_map<std::string_view, uint32_t>` 来去重字符串。`string_views` 指向一个 `std::vector<std::string>`。 当向量增长并重新分配时,所有的 `string_view` 键变成了悬空指针。哈希表中充满了无效引用。 为什么在 macOS 上可以正常工作?不同的内存分配器行为、不同的默认栈大小(8MB 对比 1MB)、不同的重新分配模式。 <p>修复方案</p> 修复前(崩溃): ```cpp uint32_t intern(std::string_view str) { auto it = indices_.find(str); if (it != indices_.end()) return it->second; uint32_t idx = strings_.size(); strings_.push_back(std::string(str)); indices_[std::string_view(strings_.back())] = idx; // 危险! return idx; } ``` 修复后: ```cpp uint32_t intern(const std::string& str) { auto it = indices_.find(std::string_view(str)); if (it != indices_.end()) return it->second; // 如果即将重新分配,提前重建 if (strings_.size() >= strings_.capacity()) { strings_.reserve(strings_.capacity() * 2); rebuildIndices(); // 修复所有 string_views! } uint32_t idx = strings_.size(); strings_.push_back(str); indices_[std::string_view(strings_.back())] = idx; return idx; } void rebuildIndices() { indices_.clear(); for (size_t i = 0; i < strings_.size(); i++) { indices_[std::string_view(strings_[i])] = i; } } ``` <p>结果</p> - 100 万行:在 Windows 上耗时 6 秒 - 多个 GB 的文件:没有崩溃 - 吞吐量约为 166,000 行/秒 - 跨平台稳定性 <p>经验教训</p> 1. `std::string_view` 功能强大但危险 - 它是一个非拥有引用。当底层存储移动时,你持有的是垃圾。 2. 跨平台测试至关重要 - 由于不同的分配器行为和较大的默认栈大小,macOS 上的这个 bug 是不可见的。 3. 结构化日志优于调试器进行交叉编译 - 我是从 Mac 交叉编译到 Windows。将带时间戳的日志记录到文件中使崩溃点立即显而易见。 4. 小改动,大影响 - 一个函数,约 15 行代码,将“在 2MB 时崩溃”变成了“处理超过 5GB 的文件”。 5. 性能保持优秀 - 重建仅在向量重新分配(指数增长)期间发生,因此摊销成本可以忽略不计。 <p>技术栈</p> - 使用 simdjson (v4.2.2) 进行解析 - 多线程解析(在我的测试机器上使用 20 个线程) - 列式存储以提高内存效率 - C++17,使用 MinGW-w64 进行交叉编译 这让我深刻认识到,最关键的 bug 往往是最简单的,隐藏在平台差异的表面之下。 欢迎讨论实现细节、simdjson 的使用或跨平台 C++ 调试技术!
查看原文
I&#x27;ve been building a cross-platform JSONL viewer app that handles multi-GB files. It worked perfectly on macOS (my development machine), but consistently crashed on Windows at exactly 2,650 KB. Here&#x27;s the debugging journey and the tiny fix that made all the difference.<p>The Problem<p>- macOS: Handles 5GB+ files effortlessly - Windows: Crashes at 2,650 KB every time - Same codebase, cross-compiled from Mac Silicon to Windows using MinGW<p>The Investigation<p>Added detailed logging to track execution. The crash happened during string interning after successfully parsing ~6,000 rows. Not during parsing, not during file I&#x2F;O, but during the merge phase.<p>The Root Cause<p>My StringPool class used std::unordered_map&lt;std::string_view, uint32_t&gt; to deduplicate strings. The string_views pointed into a std::vector&lt;std::string&gt;.<p>When the vector grew and reallocated, all the string_view keys became dangling pointers. The hash map was full of invalid references.<p>Why did it work on macOS? Different memory allocator behavior, different default stack sizes (8MB vs 1MB), different reallocation patterns.<p>The Fix<p>Before (broken):<p><pre><code> uint32_t intern(std::string_view str) { auto it = indices_.find(str); if (it != indices_.end()) return it-&gt;second; uint32_t idx = strings_.size(); strings_.push_back(std::string(str)); indices_[std::string_view(strings_.back())] = idx; &#x2F;&#x2F; DANGER! return idx; } </code></pre> After (fixed):<p><pre><code> uint32_t intern(const std::string&amp; str) { auto it = indices_.find(std::string_view(str)); if (it != indices_.end()) return it-&gt;second; &#x2F;&#x2F; Preemptively rebuild if we&#x27;re about to reallocate if (strings_.size() &gt;= strings_.capacity()) { strings_.reserve(strings_.capacity() * 2); rebuildIndices(); &#x2F;&#x2F; Fix all string_views! } uint32_t idx = strings_.size(); strings_.push_back(str); indices_[std::string_view(strings_.back())] = idx; return idx; } void rebuildIndices() { indices_.clear(); for (size_t i = 0; i &lt; strings_.size(); i++) { indices_[std::string_view(strings_[i])] = i; } } </code></pre> The Result<p>- 1 million rows: 6 seconds on Windows - Multi-GB files: No crashes - ~166,000 rows&#x2F;second throughput - Cross-platform stability<p>Lessons Learned<p>1. std::string_view is powerful but dangerous - It&#x27;s a non-owning reference. When the underlying storage moves, you&#x27;re holding garbage.<p>2. Cross-platform testing is essential - The bug was invisible on macOS due to different allocator behavior and larger default stack sizes.<p>3. Structured logging beats debuggers for cross-compilation - I was cross-compiling from Mac to Windows. Adding timestamped logging to a file made the crash point obvious immediately.<p>4. Small changes, huge impact - One function, ~15 lines of code, turned &quot;crashes at 2MB&quot; into &quot;handles 5GB+ files&quot;<p>5. Performance stayed excellent - The rebuild only happens during vector reallocation (exponential growth), so amortized cost is negligible.<p>The Tech Stack<p>- simdjson (v4.2.2) for parsing - Multi-threaded parsing (20 threads on my test machine) - Columnar storage for memory efficiency - C++17, cross-compiled with MinGW-w64<p>This was a humbling reminder that the most critical bugs are often the simplest ones, hiding in plain sight behind platform differences.<p>Happy to discuss the implementation details, simdjson usage, or cross-platform C++ debugging techniques!