HackerNews中文版

在进行机器学习特征工程和多组学数据处理时，我遇到了一个实际限制。在某个时刻，问题不再是“有多少行”，而是“有多少列”。列数从几千到几万，有时甚至更多。我在实践中观察到： - 标准SQL数据库通常限制在大约1000到1600列。 - 像Parquet这样的列式格式可以处理宽度，但通常需要Spark或Python管道。 - OLAP引擎速度很快，但通常假设相对较窄的模式。 - 特征存储通常通过将数据拆分为连接或多个表来解决这个问题。在极宽的情况下，元数据处理、查询规划甚至SQL解析都会成为瓶颈。我尝试了一种不同的方法： - 不使用连接 - 不使用事务 - 列分布而不是行 - 将SELECT作为主要操作通过这种设计，可以在具有数十万到数百万列的表上运行原生SQL选择，并且在访问部分列时具有可预测的（亚秒级）延迟。在一个小集群（2台服务器，AMD EPYC，每台128 GB内存）上，粗略的数据如下： - 创建一个100万列的表：约6分钟 - 插入一个包含100万值的单列：约2秒 - 在约5000行中选择约60列：约1秒我很好奇这里的其他人是如何处理超宽数据集的。你们是否见过在这种宽度下能够顺利工作的架构，而不需要依赖繁重的ETL或复杂的连接？

查看原文

I ran into a practical limitation while working on ML feature engineering and multi-omics data.At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.What I observed in practice:- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operationWith this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 secondI’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?

超宽表的分布式SQL引擎