Ask HN: How do you do store-and-forward telemetry at the edge?

2作者: Aydarbek大约 1 个月前原帖
I’m researching patterns for edge &#x2F; gateway telemetry where the network is unreliable (remote sites, industrial, fleets, etc.) and you need offline buffering + bounded disk + replay once connectivity returns.<p>Questions for folks running this in production:<p>What do you use today? (MQTT broker + ??, Kafka&#x2F;Redpanda&#x2F;NATS, Redis Streams, custom log files, embedded DB, etc.)<p>Where do you buffer during outages: append-only log, SQLite&#x2F;RocksDB, queue-on-disk, something else?<p>How do you handle backpressure when disk is near full? (drop policy, compression, sampling, prioritization)<p>What’s your failure nightmare: corruption, replay storms, duplicates, “stuck” consumer offsets, disk-full, clock skew?<p>What guarantees do you actually need: zero-loss vs “best effort” (and where do you draw that line)?<p>What metrics&#x2F;alerts matter most on gateways? (queue depth, replay rate, oldest event age, fsync latency, disk usage, etc.)<p>I’d love to learn what works, what breaks, and what you wish existing tools did better.
查看原文
I’m researching patterns for edge &#x2F; gateway telemetry where the network is unreliable (remote sites, industrial, fleets, etc.) and you need offline buffering + bounded disk + replay once connectivity returns.<p>Questions for folks running this in production:<p>What do you use today? (MQTT broker + ??, Kafka&#x2F;Redpanda&#x2F;NATS, Redis Streams, custom log files, embedded DB, etc.)<p>Where do you buffer during outages: append-only log, SQLite&#x2F;RocksDB, queue-on-disk, something else?<p>How do you handle backpressure when disk is near full? (drop policy, compression, sampling, prioritization)<p>What’s your failure nightmare: corruption, replay storms, duplicates, “stuck” consumer offsets, disk-full, clock skew?<p>What guarantees do you actually need: zero-loss vs “best effort” (and where do you draw that line)?<p>What metrics&#x2F;alerts matter most on gateways? (queue depth, replay rate, oldest event age, fsync latency, disk usage, etc.)<p>I’d love to learn what works, what breaks, and what you wish existing tools did better.