使用自然语言提示进行数据处理
我有一个大约10万个输入输出对的数据集,想用来微调Llama。不幸的是,这个数据集并不是很干净,所以我需要花一些时间来整理它。例如,我只想保留英文记录,并且只想包括那些输入中包含粗俗语言的记录(因为这是我用例所需的)。我还有很多类似的检查要进行,而一般来说,我无法以确定性的方式进行这些检查,因为它们需要理解自然语言。
让GPT-4o告诉我(对于单个记录)它是否是英文,以及是否包含粗俗语言,这相对简单。但如果我想对整个数据集进行这些检查,我需要设置一些异步管道,这一切变得非常繁琐。
总体而言,这个清理过程实际上花了我很长时间。我在想,大家通常用什么工具来处理这个问题?有没有什么解决方案可以帮助我更快地完成?我原本期待有一些不错的产品,可以让我上传数据集并通过提示与之互动,例如(“删除所有不包含粗俗语言的记录”),但我似乎找不到任何相关的东西。我是不是漏掉了什么显而易见的东西?
查看原文
I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.<p>It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.<p>Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?