Bzfs 1.13.0 – 跨集群实现1秒(甚至亚秒级)ZFS复制

1作者: werwolf大约 1 个月前原帖
bzfs 是一个简单、可靠的命令行工具,用于在本地或通过 SSH 复制 ZFS 快照(zfs send/receive)。它的伴侣工具 bzfs_jobrunner 可以将这一功能转化为跨 N 个源主机和 M 个目标主机的定期快照/复制/修剪作业,由一个版本化的作业配置驱动。 此次发布使得 1 秒的复制频率对于小增量变得实用,甚至在受限环境下(低 RTT、少量数据集、守护进程模式)也可以实现亚秒级的频率。 v1.13.0 主要关注于降低每次迭代的延迟——这是在大规模复制中高频率复制的敌人: - 在数据集之间和启动时重用 SSH:减少握手和往返次数,这样小增量发送可以节省大量时间。 - 更早的流启动:并行估算“待发送字节”,以便数据路径可以更早打开,而不是在预检时阻塞。 - 更智能的缓存:更快的快照列表哈希和更短的缓存路径,以减少在紧密循环中重复的 ZFS 查询。 - 更具弹性的连接:在失败之前短暂重试 SSH 控制路径,以平滑过渡瞬态波动。 - 更清晰的操作:标准化退出代码;当用户终止管道时,抑制“管道破裂”的噪声。 为什么这很重要: - 在 1 秒的频率下,固定成本(会话设置、快照枚举)占主导地位。减少 RTT 和冗余的 `zfs list` 调用比原始吞吐量带来更大的收益。 - 对于大规模系统,尾部延迟很重要:减少每个作业的抖动和启动开销,在 N×M 个作业的情况下改善端到端的新鲜度。 1 秒(及亚秒级)复制: - 使用守护进程模式来避免每个进程的启动成本;保持进程处于活跃状态,并循环使用 `--daemon-replication-frequency`(例如,`1s`,在受限情况下甚至可以是 `100ms`)。 - 重用 SSH 连接(现在为默认设置),即使对于新进程也避免握手。 - 保持每个数据集的快照计数较低,并积极修剪;较少的条目使得 `zfs list -t snapshot` 更快。 - 限制范围,仅针对真正需要该频率的数据集(如 `--exclude-dataset<i>`、`--skip-parent` 等过滤器)。 - 在大规模系统中,添加小的抖动以避免“雷鸣般的群体”,并限制工作进程以匹配 CPU、I/O 和链路 RTT。 工作原理(简要): - 从最新的公共快照进行增量发送;支持书签以确保安全和减少状态。 - 持久的 SSH 会话在数据集/zpool 之间和运行之间重用,以避免握手/执行开销。 - 快照枚举使用缓存,以避免在没有变化时重新扫描。 - 通过 bzfs_jobrunner 进行作业编排:相同的配置文件在所有主机上运行;添加抖动以避免“雷鸣般的群体”;设置工作进程计数/超时以实现规模化。 高频率提示: - 以与快照创建成比例的频率进行修剪,以保持枚举速度。 - 使用守护进程模式;将快照/复制/修剪分成专用循环。 - 在主机之间添加小的随机启动抖动,以减少跨集群争用。 - 根据您的 I/O 和 RTT 范围调整 jobrunner 的 `--workers` 和每个工作进程的超时。 快速示例: - 本地复制:`bzfs pool/src/ds pool/backup/ds` - 从远程拉取:`bzfs user@host:pool/src/ds pool/backup/ds` - Jobrunner(定期):以守护进程模式运行共享的作业配置以实现 1 秒频率:`... --replicate --daemon-replication-frequency 1s`(在受限环境下,亚秒级如 `100ms` 是可能的)。为 `--create-src-snapshots`、`--replicate` 和 `--prune-</i>` 使用单独的守护进程。 链接: - 代码和文档:https://github.com/whoschek/bzfs - README:快速入门、过滤器、安全标志、示例 - Jobrunner README:多主机编排、抖动、守护进程模式、频率 - 1.13.0 差异:https://github.com/whoschek/bzfs/compare/v1.12.0...v1.13.0 注意事项: - 仅限标准工具(ZFS/Unix 和 Python);没有额外的运行时依赖。 我希望能收到在多个数据集/主机上运行 1 秒或亚秒复制的用户的性能反馈: - 每次迭代的墙时间、增量快照的数量/大小、数据集计数和链路 RTT 有助于结果的上下文化。 欢迎提问!
查看原文
bzfs is a simple, reliable CLI for replicating ZFS snapshots (zfs send&#x2F;receive) locally or over SSH. Its companion, bzfs_jobrunner, turns that into periodic snapshot&#x2F;replication&#x2F;pruning jobs across N source hosts and M destination hosts, driven by one versioned job config.<p>This release makes 1‑second replication frequency practical for small incrementals, and even sub‑second frequency possible in constrained setups (low RTT, few datasets, daemon mode).<p>v1.13.0 focuses on cutting per‑iteration latency — the enemy of high‑frequency replication at fleet scale:<p>- SSH reuse across datasets and on startup: fewer handshakes and fewer round‑trips, which is where small incremental sends spend much of their time. - Earlier stream start: estimate &quot;bytes to send&quot; in parallel so the data path can open sooner instead of blocking on preflight. - Smarter caching: faster snapshot list hashing and shorter cache paths to reduce repeated ZFS queries in tight loops. - More resilient connects: retry the SSH control path briefly before failing to smooth over transient blips. - Cleaner ops: normalized exit codes; suppress “Broken pipe” noise when a user kills a pipeline.<p>Why this matters - At 1s cadence, fixed costs (session setup, snapshot enumeration) dominate. Shaving RTTs and redundant `zfs list` calls yields bigger wins than raw throughput. - For fleets, the tail matters: reducing per‑job jitter and startup overhead improves end‑to‑end staleness when multiplied by N×M jobs.<p>1‑second (and sub‑second) replication - Use daemon mode to avoid per‑process startup costs; keep the process hot and loop at `--daemon-replication-frequency` (e.g., `1s`, even `100ms` for constrained cases). - Reuse SSH connections (now default) to avoid handshakes even for new processes. - Keep per‑dataset snapshot counts low and prune aggressively; fewer entries make `zfs list -t snapshot` faster. - Limit scope to only datasets that truly need the cadence (filters like `--exclude-dataset<i>`, `--skip-parent`). - In fleets, add small jitter to avoid thundering herds, and cap workers to match CPU, I&#x2F;O, and link RTT.<p>How it works (nutshell) - Incremental sends from the latest common snapshot; bookmarks supported for safety and reduced state. - Persistent SSH sessions are reused across datasets&#x2F;zpools and across runs to avoid handshake&#x2F;exec overhead. - Snapshot enumeration uses a cache to avoid re‑scanning when nothing changed. - Job orchestration via bzfs_jobrunner: same config file runs on all hosts; add jitter to avoid thundering herds; set worker counts&#x2F;timeouts for scale.<p>High‑frequency tips - Prune at a frequency proportional to snapshot creation to keep enumerations fast. - Use daemon mode; split snapshot&#x2F;replicate&#x2F;prune into dedicated loops. - Add small random start jitter across hosts to reduce cross‑fleet contention. - Tune jobrunner `--workers` and per‑worker timeouts for your I&#x2F;O and RTT envelope.<p>Quick examples - Local replicate: `bzfs pool&#x2F;src&#x2F;ds pool&#x2F;backup&#x2F;ds` - Pull from remote: `bzfs user@host:pool&#x2F;src&#x2F;ds pool&#x2F;backup&#x2F;ds` - Jobrunner (periodic): run the shared jobconfig with daemon mode for 1s cadence: `... --replicate --daemon-replication-frequency 1s` (sub‑second like `100ms` is possible in constrained setups). Use separate daemons for `--create-src-snapshots`, `--replicate`, and `--prune-</i>`.<p>Links - Code and docs: https:&#x2F;&#x2F;github.com&#x2F;whoschek&#x2F;bzfs - README: quickstart, filters, safety flags, examples - Jobrunner README: multi‑host orchestration, jitter, daemon mode, frequencies - 1.13.0 diff: https:&#x2F;&#x2F;github.com&#x2F;whoschek&#x2F;bzfs&#x2F;compare&#x2F;v1.12.0...v1.13.0<p>Notes - Standard tooling only (ZFS&#x2F;Unix and Python); no extra runtime deps.<p>I’d love performance feedback from folks running 1s or sub‑second replication across multiple datasets&#x2F;hosts: - per‑iteration wall time, number&#x2F;size of incremental snapshots, dataset counts, and link RTTs help contextualize results.<p>Happy to answer questions!