HackerNews中文版

大家好！在OLake，我们的团队已经在用Go构建一个高吞吐量的数据复制工具一段时间了。随着我们不断推动实际工作负载，越来越明显的是，Go非常适合数据工程，具备简单的并发性、可预测的部署、小型容器以及出色的性能，而无需JVM。作为这一旅程的一部分，我们一直在为Apache Iceberg的Go生态系统做贡献。本周，我们的PR（拉取请求）已成功合并，允许向分区表写入数据（https://github.com/apache/iceberg-go/pull/524）。虽然这听起来可能比较小众，但它为Go服务直接写入Iceberg（无需经过Spark/Flink的绕道）并立即在Trino/Spark/DuckDB中准备查询打开了非常实用的路径。我们新增的功能包括：分区的扇出写入器，将数据拆分为多个分区，每个分区都有自己的滚动数据写入器，在达到目标文件大小时高效地刷新/滚动Parquet，支持所有常见的Iceberg转换：身份、桶、截断、年/月/日/小时，基于Arrow的写入以实现稳定的内存和快速的列式处理。我们对Go在构建OLake平台方面的信心源于：该运行时的并发模型使得协调分区写入器、批处理和背压变得简单。小型静态二进制文件 → 便于在边缘和侧车中部署数据摄取器。出色的运维故事（可观察性、性能分析和合理的资源使用），在高频率复制时尤为重要。目前这项技术的帮助在于：构建微型摄取器，将数据库中的变化流式传输到Iceberg，适用于不希望使用大型JVM堆栈的边缘或本地捕获场景，以及希望在每个写入路径上减少小文件数量的团队（无需为每个写入路径单独进行压缩作业）。对于仍然对Go感到担忧的数据团队，我们有案例研究可以帮助你：查看我们因该语言的轻量级模型而达到的基准测试结果。相关数据请见：https://olake.io/docs/benchmarks 如果你正在尝试Go + Iceberg，我们非常乐意合作，因为我们相信开源 :) 代码库链接：https://github.com/datazip-inc/olake

查看原文

hey people! At OLake, our team has been building a high-throughput data replication tool in Go for a while now. the more we push real workloads, the more it is getting clear that Go is a fantastic fit for data engineering simple concurrency, predictable deploys, tiny containers, and great perf without a JVM.As part of that journey, we’ve been contributing upstream to the Apache Iceberg Go ecosystem. this week, our PR to enable writing into partitioned tables got merged (https://github.com/apache/iceberg-go/pull/524)However that may sound niche, but it unlocks a very practical path for Go services to write straight to Iceberg (no Spark/Flink detour) and be query-ready in Trino/Spark/DuckDB right away.what we added : partitioned fan-out writer that splits data into multiple partitions, with each partition having its own rolling data writer efficient Parquet flush/roll as the target file size is reached, all the usual Iceberg transforms supported: identity, bucket, truncate, year/month/day/hour Arrow-based write for stable memory & fast columnar handlingand why we’re bullish on Go for building our platform - OLake?the runtime’s concurrency model makes it straightforward to coordinate partition writers, batching, and backpressure. small static binaries → easy to ship edge and sidecar ingestors. great ops story (observability, profiling, and sane resource usage) which is a big deal when you’re replicating at high rates. where this helps right now: building micro-ingestors that stream changes from DBs to Iceberg in Go. edge or on-prem capture where you don’t want a big JVM stack. teams that want cleaner tables (fewer tiny files) without a separate compaction job for every write path.For data teams still worried about Go, we have our case study helps you : check the benchmarks we’re hitting thanks to the language’s lightweight model See numbers here: https://olake.io/docs/benchmarksIf you’re experimenting with Go + Iceberg, we’d love to collaborate as we believe in open source :)repo: https://github.com/datazip-inc/olake/

我们使用Go语言构建了世界上最快的数据复制工具。