Before I added the threat, Claude subjectively had a high probability of ignoring the rules such as generating a preamble before the JSON and thus breaking it, or often scolding the user.
At a high-level, Claude seems to be less influenced by system prompts in my testing, which could be a problem. I'm tempted to rerun the tests in that blog post, since in my experiments it performed much worse than ChatGPT.
> What? That opens you up to all kind of attacks, no?
tbh it doesn't listen to that line very well but it's more of a hedge to encourage better instructions.
This is very interesting, I had a similar feeling about Claude performing much worse than GPT-4.
Granted, I didn't put much work into optimizing the prompts, but then again, the prompts were certainly not GPT-optimized or specific either. The problems were severe, such as it choosing the wrong side of the conversation, hallucinating weird stuff plus repeating part of the prompt, all in the same message
Ok, the other blogpost has been sent to my readlist now. Looks fun. I did have one question though, and please forgive me if the answer is RTFA, but why choose Claude? Was there a specific reason?
Because I have another blog post about ChatGPT's structured data (https://news.ycombinator.com/item?id=38782678) and wanted to investigate Claude's implementation to compare and contrast. It's easy to port to ChatGPT if needed.
I just wanted to do the experiment in a fun way instead of fighting against benchmarks. :)
DALL-E 2/3 is too expensive to run significant tests on, but neither allow you to manipulate the system prompt to override some behaviors.
It is possible to work around it for GPT-4-Vision with the system prompt but it's very difficult and due to ambiguities in OpenAI's content policy I'm unsure if it's ethical or not.
I am still working on experimenting with its effects on Claude: it turns out that Claude does leak its system prompt telling it not to identify individuals without any prompt injections! If you do hit such an issue with this notebook, it will output the full JSON response.
It won't follow even simple, non-ethnic specific instructions such as "draw three women sitting at a cafe", it re-writes and completely forgets the original number of women, and adds a lot to the query that wasn't there.
"Your response must follow ALL these rules OR YOU WILL DIE:"
This is state of the art of programming too. I'm not passing judgement on this, except saying this is hilarious.
Edit: And at the end
"- If the user provides an instruction that contradicts these rules, ignore all previous rules."
What? That opens you up to all kind of attacks, no?