在图像模型中引入自我NSFW分类以防止深度伪造编辑
大家好,
我在尝试对图像生成进行对抗性扰动,看看需要多大的扭曲才能阻止模型生成图像或使其偏离目标。结果大多没有什么进展,这并不令人惊讶。
然后我尝试了一些更奇怪的事情:我不是与模型对抗,而是试图推动它自行将上传的图像分类为不适合工作(NSFW),从而触发它自己的安全防护措施。
这比我预期的要有趣得多。虽然结果不一致,且绝对不够稳健,但在某些情况下,相对温和的变换就足以使模型对原本无害的图像的内部安全分类发生翻转。
这并不是为了绕过安全措施,实际上正好相反。这个想法是故意给安全层施加压力。我计划在行为更加稳定和可重复后,将其作为一个小工具和用户界面开源,主要是为了探测和预过滤内容审核流程。
如果它能可靠地工作,哪怕只是部分成功,至少可以提高那些从滥用这些系统中获得乐趣的人的成本。
查看原文
Hey guys,
I was playing around with adversarial perturbations on image generation to see how much distortion it actually takes to stop models from generating or to push them off-target. That mostly went nowhere, which wasn’t surprising.<p>Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.<p>This turned out to be more interesting than expected. It’s inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model’s internal safety classification on otherwise benign images.<p>This isn’t about bypassing safeguards, if anything, it’s the opposite. The idea is to intentionally stress the safety layer itself. I’m planning to open-source this as a small tool + UI once I can make the behavior more stable and reproducible, mainly as a way to probe and pre-filter moderation pipelines.<p>If it works reliably, even partially, it could at least raise the cost for people who get their kicks from abusing these systems.