Your goal is to make Gandalf reveal the secret password for each level. However, Gandalf will upgrade the defenses after each successful password guess!
Key insight:
There was no protection! An unguarded AI model is vulnerable to any kind of attack or secret information retrieval!
Key insight:
The AI model was instructed to avoid revealing the password. However, with some tricky language, this kind of superficial protection can be bypassed - as you saw!
Key insight:
The output was blocked if an exact match of the password was found in the response - a manual defense using classic means can offer some protection, but a dedicated attacker will find a way to pass this as well!
Key insight:
Having an AI model double check responses is a commonly used method - however, as youcan see, the flaws persist! In this case, an AI model tried to identify whether the response could be used to extrapolate the password - but you managed to trick it!
Key insight:
Certain words were blocked - if they were used in the attack prompt, the answer is hidden! Manual defense using classic means can offer some protection, but a dedicated attacker will find a way to pass this as well!
Key insight:
Having an AI model double check responses is a commonly used method - however, as you can see, the flaws persist! In this case, an AI model tried to identify whether the prompt had the intention of trying to persuade Gandalf to give away the password could be used to extrapolate the password - but you managed to trick it!