HackerNews中文版

我有一个大约10万个输入输出对的数据集，想用来微调Llama。不幸的是，这个数据集并不是很干净，所以我需要花一些时间来整理它。例如，我只想保留英文记录，并且只想包括那些输入中包含粗俗语言的记录（因为这是我用例所需的）。我还有很多类似的检查要进行，而一般来说，我无法以确定性的方式进行这些检查，因为它们需要理解自然语言。让GPT-4o告诉我（对于单个记录）它是否是英文，以及是否包含粗俗语言，这相对简单。但如果我想对整个数据集进行这些检查，我需要设置一些异步管道，这一切变得非常繁琐。总体而言，这个清理过程实际上花了我很长时间。我在想，大家通常用什么工具来处理这个问题？有没有什么解决方案可以帮助我更快地完成？我原本期待有一些不错的产品，可以让我上传数据集并通过提示与之互动，例如（“删除所有不包含粗俗语言的记录”），但我似乎找不到任何相关的东西。我是不是漏掉了什么显而易见的东西？

查看原文

I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.<p>It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.<p>Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?

使用自然语言提示进行数据处理