加速Spark查询
在查询中,连接操作总是让人头疼,有时更好的方法是在数据内部创建多维索引。因此,在我的业余时间,我开发了LitenDB,这是一个开源项目,它扩展了Spark,利用存储在Delta Lake中的数据和索引,将数据重塑为快速、分布式的张量,使用Arrow技术:
https://github.com/hkverma/litendb
它加速了连接密集型和分析型查询,简化了计划,并能实现10到100倍的性能提升。您可以在这里尝试Colab笔记本,看看它是如何工作的:
https://github.com/hkverma/litendb/blob/main/py/notebooks/LitenTpchQ5Q6.ipynb
非常期待听到社区的反馈,并探索合作的机会。
谢谢,
HK
查看原文
In queries, joins are always painful, and sometimes the better approach is to create multi-dimensional indices inside the data itself. So in my spare time I built LitenDB, an open-source project that extends Spark with data and indices stored in Delta Lake to reshape data into fast, distributed tensors using Arrow:
https://github.com/hkverma/litendb<p>It speeds up join-heavy and analytic queries, simplifies plans, and can deliver 10–100× performance improvements. You can try the Colab notebook here to see how it works:
https://github.com/hkverma/litendb/blob/main/py/notebooks/LitenTpchQ5Q6.ipynb<p>Would love to hear feedback from the community and explore collaborations.
Thanks,
HK