HackerNews中文版

大家好！我是Win，ParaQuery的创始人（<a href="https://paraquery.com">https://paraquery.com</a>），我们提供一个完全托管的、GPU加速的Spark + SQL解决方案。我们在易用性上与BigQuery相当（甚至更易用），同时在成本效益和性能上显著更优。这里有一个简短的演示视频，展示了ParaQuery（与BigQuery的对比）在一个简单的ETL任务上的表现：<a href="https://www.youtube.com/watch?v=uu379YnccGU" rel="nofollow">https://www.youtube.com/watch?v=uu379YnccGU</a> 众所周知，GPU在许多SQL和数据框任务中表现出色，至少在研究人员和像NVIDIA这样的GPU公司中是这样。以至于在2018年，NVIDIA推出了RAPIDS计划和Spark-RAPIDS插件（<a href="https://github.com/NVIDIA/spark-rapids">https://github.com/NVIDIA/spark-rapids</a>）。我之所以发现这一点，是因为当时我正在尝试制作一个基于CUDA的λ演算解释器……这是我未能实现的几个想法之一，哈哈。在一些工程师中似乎存在一种看法，认为GPU仅适用于人工智能、图形处理，也许还包括图像处理（也许！有人甚至告诉我，他们认为GPU不适合图像处理！）。但实际上，GPU在传统数据处理上同样表现良好！从高层次来看，大数据处理是一种高吞吐量、大规模并行的工作负载。GPU是一种专门为此设计的硬件，具有高度可编程性，并且（现在）在云端高度可用！更棒的是，GPU的内存是针对带宽而非原始延迟进行优化的，这使得它们在吞吐能力上优于CPU。通过简单地使用云成本计算器几分钟，就可以清楚地看到，即使在主要云平台上，GPU也是具有成本效益的。老实说，我原以为使用GPU进行SQL处理现在应该已经普及，但事实并非如此。因此，在一年多前，我开始着手实际部署一个基于云的GPU数据平台（即Spark-RAPIDS），这受到了一位朋友的朋友（的朋友）的启发，他在创业公司中对BigQuery的成本表示担忧。在完成概念验证和意向书后……嗯，什么也没发生！即使在半年后也是如此。但随后，发生了一些神奇的事情：他们的云信用额度用完了！现在，他们通过使用ParaQuery节省了超过60%的BigQuery账单，同时速度也提高了2倍——完全不需要数据迁移（得益于Spark的GCS连接器）。顺便说一下，我不确定其他人的经历，但……我们离IO瓶颈相当远（这让我与许多我交谈过的工程师感到惊讶）。我认为高吞吐量计算的未来在于在高吞吐量硬件上进行计算。如果你也这么认为，或者你面临数据扩展的挑战，可以在这里注册：<a href="https://paraquery.com/waitlist">https://paraquery.com/waitlist</a>。抱歉需要排队，但我们还没有准备好提供自助服务体验——这将需要大量的工程和硬件成本。但我们会做到的，请继续关注！感谢阅读！你们在处理巨大的ETL/处理负载时有什么经验？成本或性能是否是个问题？你对GPU加速（GPGPU）有什么看法？你认为GPU只是贵吗？我很想在这里聊聊技术！

查看原文

Hey HN! I'm Win, founder of ParaQuery (<a href="https://paraquery.com">https://paraquery.com</a>), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.Here's a short demo video demonstrating ParaQuery (vs. BigQuery) on a simple ETL job: <a href="https://www.youtube.com/watch?v=uu379YnccGU" rel="nofollow">https://www.youtube.com/watch?v=uu379YnccGU</a>It's well known that GPUs are very good for many SQL and dataframe tasks, at least by researchers and GPU companies like NVIDIA. So much so that, in 2018, NVIDIA launched the RAPIDS program and the Spark-RAPIDS plugin (<a href="https://github.com/NVIDIA/spark-rapids">https://github.com/NVIDIA/spark-rapids</a>). I actually found out because, at the time, I was trying to craft a CUDA-based lambda calculus interpreter…one of several ideas I didn't manage to implement, haha.There seems to be a perception among at least some engineers that GPUs are only good for AI, graphics, and maybe image processing (maybe! someone actually told me they thought GPUs are bad for image processing!) Traditional data processing doesn’t come to mind. But actually GPUs are good for this as well!At a high level, big data processing is a high-throughput, massively parallel workload. GPUs are a type of hardware specialized for this, are highly programmable, and (now) happen to be highly available on the cloud! Even better, GPU memory is tuned for bandwidth over raw latency, which only improves their throughput capabilities compared to a CPU. And by just playing with cloud cost calculators for a couple of minutes, it's clear that GPUs are cost-effective even on the major clouds.To be honest, I thought using GPUs for SQL processing would have taken off by now, but it hasn't. So, just over a year ago, I started working on actually deploying a cloud-based data platform powered by GPUs (i.e. Spark-RAPIDS), spurred by a friend-of-a-friend(-of-a-friend) who happened to have BigQuery cost concerns at his startup. After getting a proof of concept done and a letter of intent... well, nothing happened! Even after over half a year. But then, something magical did happen: their cloud credits ran out!And now, they're saving over 60% off of their BigQuery bill by using ParaQuery, while also being 2x faster -- with zero data migration needed (courtesy of Spark's GCS connector). By the way, I'm not sure about other people's experiences but... we're pretty far from being IO-bound (to the surprise of many engineers I've spoken to).I think that the future of high-throughput compute is computing on high-throughput hardware. If you think so too, or you have scaling data challenges, you can sign up here: <a href="https://paraquery.com/waitlist">https://paraquery.com/waitlist</a>. Sorry for the waitlist, but we're not ready for a self-serve experience just yet—it would front-load significant engineering and hardware cost. But we’ll get there, so stay tuned!Thanks for reading! What have your experiences been with huge ETL / processing loads? Was cost or performance an issue? And what do you think about GPU acceleration (GPGPU)? Did you think GPUs were simply expensive? Would love to just talk about tech here!

启动 HN：ParaQuery（YC X25）– GPU 加速的 Spark/SQL