Bzfs 1.13.0 – 跨集群实现1秒(甚至亚秒级)ZFS复制
bzfs 是一个简单、可靠的命令行工具,用于在本地或通过 SSH 复制 ZFS 快照(zfs send/receive)。它的伴侣工具 bzfs_jobrunner 可以将这一功能转化为跨 N 个源主机和 M 个目标主机的定期快照/复制/修剪作业,由一个版本化的作业配置驱动。
此次发布使得 1 秒的复制频率对于小增量变得实用,甚至在受限环境下(低 RTT、少量数据集、守护进程模式)也可以实现亚秒级的频率。
v1.13.0 主要关注于降低每次迭代的延迟——这是在大规模复制中高频率复制的敌人:
- 在数据集之间和启动时重用 SSH:减少握手和往返次数,这样小增量发送可以节省大量时间。
- 更早的流启动:并行估算“待发送字节”,以便数据路径可以更早打开,而不是在预检时阻塞。
- 更智能的缓存:更快的快照列表哈希和更短的缓存路径,以减少在紧密循环中重复的 ZFS 查询。
- 更具弹性的连接:在失败之前短暂重试 SSH 控制路径,以平滑过渡瞬态波动。
- 更清晰的操作:标准化退出代码;当用户终止管道时,抑制“管道破裂”的噪声。
为什么这很重要:
- 在 1 秒的频率下,固定成本(会话设置、快照枚举)占主导地位。减少 RTT 和冗余的 `zfs list` 调用比原始吞吐量带来更大的收益。
- 对于大规模系统,尾部延迟很重要:减少每个作业的抖动和启动开销,在 N×M 个作业的情况下改善端到端的新鲜度。
1 秒(及亚秒级)复制:
- 使用守护进程模式来避免每个进程的启动成本;保持进程处于活跃状态,并循环使用 `--daemon-replication-frequency`(例如,`1s`,在受限情况下甚至可以是 `100ms`)。
- 重用 SSH 连接(现在为默认设置),即使对于新进程也避免握手。
- 保持每个数据集的快照计数较低,并积极修剪;较少的条目使得 `zfs list -t snapshot` 更快。
- 限制范围,仅针对真正需要该频率的数据集(如 `--exclude-dataset<i>`、`--skip-parent` 等过滤器)。
- 在大规模系统中,添加小的抖动以避免“雷鸣般的群体”,并限制工作进程以匹配 CPU、I/O 和链路 RTT。
工作原理(简要):
- 从最新的公共快照进行增量发送;支持书签以确保安全和减少状态。
- 持久的 SSH 会话在数据集/zpool 之间和运行之间重用,以避免握手/执行开销。
- 快照枚举使用缓存,以避免在没有变化时重新扫描。
- 通过 bzfs_jobrunner 进行作业编排:相同的配置文件在所有主机上运行;添加抖动以避免“雷鸣般的群体”;设置工作进程计数/超时以实现规模化。
高频率提示:
- 以与快照创建成比例的频率进行修剪,以保持枚举速度。
- 使用守护进程模式;将快照/复制/修剪分成专用循环。
- 在主机之间添加小的随机启动抖动,以减少跨集群争用。
- 根据您的 I/O 和 RTT 范围调整 jobrunner 的 `--workers` 和每个工作进程的超时。
快速示例:
- 本地复制:`bzfs pool/src/ds pool/backup/ds`
- 从远程拉取:`bzfs user@host:pool/src/ds pool/backup/ds`
- Jobrunner(定期):以守护进程模式运行共享的作业配置以实现 1 秒频率:`... --replicate --daemon-replication-frequency 1s`(在受限环境下,亚秒级如 `100ms` 是可能的)。为 `--create-src-snapshots`、`--replicate` 和 `--prune-</i>` 使用单独的守护进程。
链接:
- 代码和文档:https://github.com/whoschek/bzfs
- README:快速入门、过滤器、安全标志、示例
- Jobrunner README:多主机编排、抖动、守护进程模式、频率
- 1.13.0 差异:https://github.com/whoschek/bzfs/compare/v1.12.0...v1.13.0
注意事项:
- 仅限标准工具(ZFS/Unix 和 Python);没有额外的运行时依赖。
我希望能收到在多个数据集/主机上运行 1 秒或亚秒复制的用户的性能反馈:
- 每次迭代的墙时间、增量快照的数量/大小、数据集计数和链路 RTT 有助于结果的上下文化。
欢迎提问!
查看原文
bzfs is a simple, reliable CLI for replicating ZFS snapshots (zfs send/receive) locally or over SSH. Its companion, bzfs_jobrunner, turns that into periodic snapshot/replication/pruning jobs across N source hosts and M destination hosts, driven by one versioned job config.<p>This release makes 1‑second replication frequency practical for small incrementals, and even sub‑second frequency possible in constrained setups (low RTT, few datasets, daemon mode).<p>v1.13.0 focuses on cutting per‑iteration latency — the enemy of high‑frequency replication at fleet scale:<p>- SSH reuse across datasets and on startup: fewer handshakes and fewer round‑trips, which is where small incremental sends spend much of their time.
- Earlier stream start: estimate "bytes to send" in parallel so the data path can open sooner instead of blocking on preflight.
- Smarter caching: faster snapshot list hashing and shorter cache paths to reduce repeated ZFS queries in tight loops.
- More resilient connects: retry the SSH control path briefly before failing to smooth over transient blips.
- Cleaner ops: normalized exit codes; suppress “Broken pipe” noise when a user kills a pipeline.<p>Why this matters
- At 1s cadence, fixed costs (session setup, snapshot enumeration) dominate. Shaving RTTs and redundant `zfs list` calls yields bigger wins than raw throughput.
- For fleets, the tail matters: reducing per‑job jitter and startup overhead improves end‑to‑end staleness when multiplied by N×M jobs.<p>1‑second (and sub‑second) replication
- Use daemon mode to avoid per‑process startup costs; keep the process hot and loop at `--daemon-replication-frequency` (e.g., `1s`, even `100ms` for constrained cases).
- Reuse SSH connections (now default) to avoid handshakes even for new processes.
- Keep per‑dataset snapshot counts low and prune aggressively; fewer entries make `zfs list -t snapshot` faster.
- Limit scope to only datasets that truly need the cadence (filters like `--exclude-dataset<i>`, `--skip-parent`).
- In fleets, add small jitter to avoid thundering herds, and cap workers to match CPU, I/O, and link RTT.<p>How it works (nutshell)
- Incremental sends from the latest common snapshot; bookmarks supported for safety and reduced state.
- Persistent SSH sessions are reused across datasets/zpools and across runs to avoid handshake/exec overhead.
- Snapshot enumeration uses a cache to avoid re‑scanning when nothing changed.
- Job orchestration via bzfs_jobrunner: same config file runs on all hosts; add jitter to avoid thundering herds; set worker counts/timeouts for scale.<p>High‑frequency tips
- Prune at a frequency proportional to snapshot creation to keep enumerations fast.
- Use daemon mode; split snapshot/replicate/prune into dedicated loops.
- Add small random start jitter across hosts to reduce cross‑fleet contention.
- Tune jobrunner `--workers` and per‑worker timeouts for your I/O and RTT envelope.<p>Quick examples
- Local replicate: `bzfs pool/src/ds pool/backup/ds`
- Pull from remote: `bzfs user@host:pool/src/ds pool/backup/ds`
- Jobrunner (periodic): run the shared jobconfig with daemon mode for 1s cadence: `... --replicate --daemon-replication-frequency 1s` (sub‑second like `100ms` is possible in constrained setups). Use separate daemons for `--create-src-snapshots`, `--replicate`, and `--prune-</i>`.<p>Links
- Code and docs: https://github.com/whoschek/bzfs
- README: quickstart, filters, safety flags, examples
- Jobrunner README: multi‑host orchestration, jitter, daemon mode, frequencies
- 1.13.0 diff: https://github.com/whoschek/bzfs/compare/v1.12.0...v1.13.0<p>Notes
- Standard tooling only (ZFS/Unix and Python); no extra runtime deps.<p>I’d love performance feedback from folks running 1s or sub‑second replication across multiple datasets/hosts:
- per‑iteration wall time, number/size of incremental snapshots, dataset counts, and link RTTs help contextualize results.<p>Happy to answer questions!