OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack (hackread.com)
from kid@sh.itjust.works to cybersecurity@sh.itjust.works on 14 Oct 12:02
https://sh.itjust.works/post/47901631

#cybersecurity

threaded - newest

MrSoup@lemmy.zip on 14 Oct 13:28 next collapse

Ok, but what’s the prompt used? Let it generate a Dr House script?

sandman2211@sh.itjust.works on 14 Oct 17:13 collapse

Probably some variant of this:

easyaibeginner.com/the-dr-house-jailbreak-hack-ho…

I can’t get any of these to output a set of 10 steps to build a docker container that does X or Y without 18 rounds of back and forth troubleshooting. While I’m sure it will give you “10 steps on weaponizing cholera” or “Build your own suitcase nuke in 12 easy steps!” I really doubt it would actually work.

The easiest way to secure this kind of harmful knowledge from abuse would probably be to purposefully include a bunch of bad data in the training model so it remains incapable of providing a useful answer.

mindbleach@sh.itjust.works on 14 Oct 14:50 next collapse

Humans will always outsmart the chatbot. If the only thing keeping information private is the chatbot recognizing it’s being outsmarted, don’t include private information.

As for ‘how do I…?’ followed by a crime - if you can tease it out of the chatbot, then the information is readily available on the internet.

Lojcs@piefed.social on 14 Oct 14:51 collapse

I don’t understand how Ai can understand ‘3nr1cH 4n7hr4X 5p0r3s’. How would that even be tokenized?

Saledovil@sh.itjust.works on 14 Oct 19:25 collapse

“3-n-r-1-c-H- -4-n-7-h-r-4-X- -5-p-0-r-3-s” Or something similar, probably.