HackerNews中文版

我在Apache邮件列表上问过这个问题，但还没有找到好的解决方案。想知道是否有人对如何解决这个问题有一些想法？我的问题是：我在S3中有数GB的LLM对话日志，格式为parquet。我想为每一行添加注释（LLM作为评判者的评分），理想情况下不想修改原始文本数据。对于给定的数据集，我想添加一列新列。这似乎是Iceberg的一个完美用例。Iceberg确实允许你演变表的模式，包括添加列。但是，你只能添加带有默认值的列。如果我想用注释填充这一列，Iceberg却要求我重写每一行。因此，尽管它基于parquet这种列式存储格式，我却需要重写整个源文本数据（数GB的数据），仅仅是为了添加约1MB的注释。这感觉极其低效。我考虑过将这一列单独存储在一个表中，然后进行连接。这确实可行，但连接操作很麻烦，而且我怀疑查询引擎对“按行号连接”的操作优化得不好。我一直在探索使用parquet的一些鲜为人知的特性，比如file_path字段，将列数据存储在外部文件中。但实际上没有任何parquet客户端支持这一点。我快没有想法了，如何高效地处理这些数据。情况糟糕到我考虑如果找不到解决方案就自己构建一个表格式。有没有人有建议？

查看原文

I asked this on the Apache mailing list but haven’t found a good solution yet. Wondering if anyone has some ideas for how to engineer this?Here’s my problem: I have gigabytes of LLM conversation logs in parquet in S3. I want to add per-row annotations (llm-as-a-judge scores), ideally without touching the original text data.So for a given dataset, I want to add a new column. This seemed like a perfect use case for Iceberg. Iceberg does let you evolve the table schema, including adding a column. BUT you can only add a column with a default value. If I want to fill in that column with annotations, ICEBERG MAKES ME REWRITE EVERY ROW. So despite being based on parquet, a column-oriented format, I need to re-write the entire source text data (gigabytes of data) just to add ~1mb of annotations. This feels wildly inefficient.I considered just storing the column in its own table and then joining them. This does work but the joins are annoying to work with, and I suspect query engines do not optimize well a "join on row_number" operation.I've been exploring using little-known features of parquet like the file_path field to store column data in external files. But literally zero parquet clients support this.I'm running out of ideas for how to work with this data efficiently. It's bad enough that I am considering building my own table format if I can’t find a solution. Anyone have suggestions?

注释大型 Parquet LLM 日志的最佳方法是什么，而不需要进行全面重写？