超宽表的分布式SQL引擎
在进行机器学习特征工程和多组学数据处理时,我遇到了一个实际限制。
在某个时刻,问题不再是“有多少行”,而是“有多少列”。列数从几千到几万,有时甚至更多。
我在实践中观察到:
- 标准SQL数据库通常限制在大约1000到1600列。
- 像Parquet这样的列式格式可以处理宽度,但通常需要Spark或Python管道。
- OLAP引擎速度很快,但通常假设相对较窄的模式。
- 特征存储通常通过将数据拆分为连接或多个表来解决这个问题。
在极宽的情况下,元数据处理、查询规划甚至SQL解析都会成为瓶颈。
我尝试了一种不同的方法:
- 不使用连接
- 不使用事务
- 列分布而不是行
- 将SELECT作为主要操作
通过这种设计,可以在具有数十万到数百万列的表上运行原生SQL选择,并且在访问部分列时具有可预测的(亚秒级)延迟。
在一个小集群(2台服务器,AMD EPYC,每台128 GB内存)上,粗略的数据如下:
- 创建一个100万列的表:约6分钟
- 插入一个包含100万值的单列:约2秒
- 在约5000行中选择约60列:约1秒
我很好奇这里的其他人是如何处理超宽数据集的。你们是否见过在这种宽度下能够顺利工作的架构,而不需要依赖繁重的ETL或复杂的连接?
查看原文
I ran into a practical limitation while working on ML feature engineering and multi-omics data.<p>At some point, the problem stops being “how many rows” and becomes “how many columns”.
Thousands, then tens of thousands, sometimes more.<p>What I observed in practice:<p>- Standard SQL databases usually cap out around ~1,000–1,600 columns.
- Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines.
- OLAP engines are fast, but tend to assume relatively narrow schemas.
- Feature stores often work around this by exploding data into joins or multiple tables.<p>At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.<p>I experimented with a different approach:
- no joins
- no transactions
- columns distributed instead of rows
- SELECT as the primary operation<p>With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.<p>On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like:
- creating a 1M-column table: ~6 minutes
- inserting a single column with 1M values: ~2 seconds
- selecting ~60 columns over ~5,000 rows: ~1 second<p>I’m curious how others here approach ultra-wide datasets.
Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?