使用自然语言提示进行数据处理

4作者: brockmeier8 个月前原帖
我有一个大约10万个输入输出对的数据集,想用来微调Llama。不幸的是,这个数据集并不是很干净,所以我需要花一些时间来整理它。例如,我只想保留英文记录,并且只想包括那些输入中包含粗俗语言的记录(因为这是我用例所需的)。我还有很多类似的检查要进行,而一般来说,我无法以确定性的方式进行这些检查,因为它们需要理解自然语言。 让GPT-4o告诉我(对于单个记录)它是否是英文,以及是否包含粗俗语言,这相对简单。但如果我想对整个数据集进行这些检查,我需要设置一些异步管道,这一切变得非常繁琐。 总体而言,这个清理过程实际上花了我很长时间。我在想,大家通常用什么工具来处理这个问题?有没有什么解决方案可以帮助我更快地完成?我原本期待有一些不错的产品,可以让我上传数据集并通过提示与之互动,例如(“删除所有不包含粗俗语言的记录”),但我似乎找不到任何相关的东西。我是不是漏掉了什么显而易见的东西?
查看原文
I&#x27;ve got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it&#x27;s not the cleanest dataset so I&#x27;m having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that&#x27;s what I need for my use-case). There&#x27;s loads more checks like these that I want to run, and in general I can&#x27;t run these checks in a deterministic way because they require understanding natural language.<p>It&#x27;s relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it&#x27;s in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.<p>Collectively this cleaning process is actually taking me ages. I&#x27;m wondering, what do y&#x27;all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. (&#x27;remove all records without foul language in them&#x27;), but I can&#x27;t really find anything. Am I missing something super obvious?