我们使用Go语言构建了世界上最快的数据复制工具。
大家好!
在OLake,我们的团队已经在用Go构建一个高吞吐量的数据复制工具一段时间了。随着我们不断推动实际工作负载,越来越明显的是,Go非常适合数据工程,具备简单的并发性、可预测的部署、小型容器以及出色的性能,而无需JVM。
作为这一旅程的一部分,我们一直在为Apache Iceberg的Go生态系统做贡献。本周,我们的PR(拉取请求)已成功合并,允许向分区表写入数据(https://github.com/apache/iceberg-go/pull/524)。
虽然这听起来可能比较小众,但它为Go服务直接写入Iceberg(无需经过Spark/Flink的绕道)并立即在Trino/Spark/DuckDB中准备查询打开了非常实用的路径。
我们新增的功能包括:分区的扇出写入器,将数据拆分为多个分区,每个分区都有自己的滚动数据写入器,在达到目标文件大小时高效地刷新/滚动Parquet,支持所有常见的Iceberg转换:身份、桶、截断、年/月/日/小时,基于Arrow的写入以实现稳定的内存和快速的列式处理。
我们对Go在构建OLake平台方面的信心源于:该运行时的并发模型使得协调分区写入器、批处理和背压变得简单。小型静态二进制文件 → 便于在边缘和侧车中部署数据摄取器。出色的运维故事(可观察性、性能分析和合理的资源使用),在高频率复制时尤为重要。目前这项技术的帮助在于:构建微型摄取器,将数据库中的变化流式传输到Iceberg,适用于不希望使用大型JVM堆栈的边缘或本地捕获场景,以及希望在每个写入路径上减少小文件数量的团队(无需为每个写入路径单独进行压缩作业)。
对于仍然对Go感到担忧的数据团队,我们有案例研究可以帮助你:查看我们因该语言的轻量级模型而达到的基准测试结果。相关数据请见:https://olake.io/docs/benchmarks
如果你正在尝试Go + Iceberg,我们非常乐意合作,因为我们相信开源 :)
代码库链接:https://github.com/datazip-inc/olake
查看原文
hey people!
At OLake, our team has been building a high-throughput data replication tool in Go for a while now. the more we push real workloads, the more it is getting clear that Go is a fantastic fit for data engineering simple concurrency, predictable deploys, tiny containers, and great perf without a JVM.<p>As part of that journey, we’ve been contributing upstream to the Apache Iceberg Go ecosystem. this week, our PR to enable writing into partitioned tables got merged (https://github.com/apache/iceberg-go/pull/524)<p>However that may sound niche, but it unlocks a very practical path for Go services to write straight to Iceberg (no Spark/Flink detour) and be query-ready in Trino/Spark/DuckDB right away.<p>what we added : partitioned fan-out writer that splits data into multiple partitions, with each partition having its own rolling data writer efficient Parquet flush/roll as the target file size is reached, all the usual Iceberg transforms supported: identity, bucket, truncate, year/month/day/hour Arrow-based write for stable memory & fast columnar handling<p>and why we’re bullish on Go for building our platform - OLake?<p>the runtime’s concurrency model makes it straightforward to coordinate partition writers, batching, and backpressure. small static binaries → easy to ship edge and sidecar ingestors. great ops story (observability, profiling, and sane resource usage) which is a big deal when you’re replicating at high rates. where this helps right now: building micro-ingestors that stream changes from DBs to Iceberg in Go. edge or on-prem capture where you don’t want a big JVM stack. teams that want cleaner tables (fewer tiny files) without a separate compaction job for every write path.<p>For data teams still worried about Go, we have our case study helps you : check the benchmarks we’re hitting thanks to the language’s lightweight model See numbers here: https://olake.io/docs/benchmarks<p>If you’re experimenting with Go + Iceberg, we’d love to collaborate as we believe in open source :)<p>repo: https://github.com/datazip-inc/olake/