注释大型 Parquet LLM 日志的最佳方法是什么,而不需要进行全面重写?
我在Apache邮件列表上问过这个问题,但还没有找到好的解决方案。想知道是否有人对如何解决这个问题有一些想法?
我的问题是:我在S3中有数GB的LLM对话日志,格式为parquet。我想为每一行添加注释(LLM作为评判者的评分),理想情况下不想修改原始文本数据。
对于给定的数据集,我想添加一列新列。这似乎是Iceberg的一个完美用例。Iceberg确实允许你演变表的模式,包括添加列。但是,你只能添加带有默认值的列。如果我想用注释填充这一列,Iceberg却要求我重写每一行。因此,尽管它基于parquet这种列式存储格式,我却需要重写整个源文本数据(数GB的数据),仅仅是为了添加约1MB的注释。这感觉极其低效。
我考虑过将这一列单独存储在一个表中,然后进行连接。这确实可行,但连接操作很麻烦,而且我怀疑查询引擎对“按行号连接”的操作优化得不好。
我一直在探索使用parquet的一些鲜为人知的特性,比如file_path字段,将列数据存储在外部文件中。但实际上没有任何parquet客户端支持这一点。
我快没有想法了,如何高效地处理这些数据。情况糟糕到我考虑如果找不到解决方案就自己构建一个表格式。有没有人有建议?
查看原文
I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?<p>Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.<p>So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.<p>I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.<p>I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.<p>I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?