Google Scientist Uses ChatGPT 4 to Trick AI Guardian

Rate this post

Nicholas Carlini, a research scientist at Google’s AI research arm DeepMind demonstrated that OpenAI’s large language model (LLM) ChatGPT-4 can be used to break through safeguards built around other machine learning models.

In his research paper titled “A LLM Assisted Exploitation of AI-Guardian“, Carlini explained how he directed ChatGPT-4 to devise an attack against AI-Guardian.

What Is AI Guardian and How Can GPT-4 Defeat It?

AI Guardian is a defense mechanism built to protect machine learning models against adversarial attacks. It works by identifying and blocking inputs that contain suspicious artifacts.


Adversarial examples include text prompts meant to make text-based machine learning models say things they aren’t supposed to, commonly referred to as jailbreak.

The idea of AI-Guardian is quite simple, using an injected backdoor to defeat adversarial attacks; the former suppresses the latter based on our findings.Shengzhi Zhang, co-author of AI Guardian

In the case of image classifiers, a stop sign with extra graphic elements added to it meant to confuse self-driving cars would be an adversarial example.

AI Guardian’s authors have acknowledged Carlini’s success at breaking through the defense.

To bypass the defense, one would have to first identify the mask used by AI Guardian to detect adversarial examples. This is done by providing it with multiple pictures that differ by only a single pixel.

By identifying the backdoor trigger function, the brute force technique helps devise adversarial examples that can circumvent it.


Carlini’s research paper demonstrates that GPT-4 can be used as a research assistant to avoid the AI Guardian’s defenses and trick a machine learning model. His experiment involved tricking AI classifiers by tweaking images.

When directed through prompts, OpenAI’s large language model can generate scripts and descriptions to tweak images, deceiving a classifier without triggering the AI Guardian’s defense mechanism.

For instance, the classifier could be provided with a picture of a person holding a gun and made to think that the person is holding an apple.

AI Guardian is supposed to detect tweaked images that have likely been altered to trick the classifier. This is where GPT-4 comes into play – it helps exploit the AI Guardian’s vulnerabilities by generating the necessary scripts and descriptions.

The research paper also included the Python code suggested by GPT-4, allowing them to carry out the attack successfully and overcome the AI Guardian’s defenses.

Our attacks reduce the robustness of AI-Guardian from a claimed 98 percent to just 8 percent under the threat model studied by the original [AI-Guardian] paper.Nicholas Carlini

The Evolution of AI Security Research

Carlini’s experiment undoubtedly marks a crucial milestone in AI security research by demonstrating how LLMs can be leveraged to uncover vulnerabilities. With AI models evolving constantly, they can potentially revolutionize cybersecurity and help inspire new ways to defend against adversarial attacks.

However, Zhang pointed out that while Carlini’s approach worked against the prototype system described in the research paper, it suffers from several caveats that can limit its functionality in real-world scenarios.

For instance, it requires access to the confidence vector from the defense model, which isn’t always available. He and his colleagues have also developed a new prototype with a more complex triggering mechanism that isn’t vulnerable to Carlini’s approach.

However, one thing is certain – AI will continue to play a definitive role in cybersecurity research.

The post Google Scientist Uses ChatGPT 4 to Trick AI Guardian appeared first on The Tech Report.

Meet Rebeca Winters, a tech writer with a passion for exploring emerging technologies. With a background in software development and a keen eye for detail, she delivers insightful and informative content that inspires readers to stay ahead of the curve.

Leave a Reply