Adversarial Attacks: How Hackers Fool Image Recognition AI

Problem: AI Models Are Dangerously Easy to Fool

You've trained a state-of-the-art image classifier. It hits 99% accuracy on your test set. Then someone adds barely visible noise to an image — and your model confidently misclassifies a stop sign as a speed limit sign.

This is an adversarial attack. It's real, it's well-documented, and it's actively exploited.

You'll learn:

Why neural networks are vulnerable to adversarial inputs
How the most common attack methods work (with code)
What defenses are actually effective in production

Time: 12 min | Level: Intermediate

Why This Happens

Neural networks don't "see" images the way humans do. They learn high-dimensional statistical patterns in pixel data — and those patterns have unexpected blind spots.

When you train a model, it draws decision boundaries through feature space. Adversarial examples are inputs that have been carefully nudged across those boundaries. The change is imperceptible to humans (often just 1-2 pixel values out of 255), but it's enough to send the model to a completely different classification region.

Common symptoms:

Model classifies physically printed adversarial patches incorrectly in the real world
Confidence scores remain high (95%+) even on wrong predictions
Standard data augmentation doesn't help — the attack adapts to it

Diagram of decision boundary and adversarial perturbation A small perturbation moves the input across the model's decision boundary — invisible to you, catastrophic for the classifier

The Main Attack Types

Fast Gradient Sign Method (FGSM)

FGSM is the simplest attack. It computes the gradient of the loss with respect to the input image, then nudges pixels in the direction that increases the loss.

import torch
import torch.nn.functional as F

def fgsm_attack(image, epsilon, data_grad):
    # Take the sign of the gradient — direction matters, not magnitude
    sign_data_grad = data_grad.sign()
    
    # Perturb the image by epsilon in the gradient direction
    perturbed_image = image + epsilon * sign_data_grad
    
    # Clip to valid pixel range [0, 1]
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

# Usage
image.requires_grad = True
output = model(image)
loss = F.cross_entropy(output, true_label)
model.zero_grad()
loss.backward()

perturbed = fgsm_attack(image, epsilon=0.01, data_grad=image.grad.data)

Expected: With epsilon=0.01, the image looks identical to the human eye but the model misclassifies it.

If it fails:

Model still correct: Increase epsilon (try 0.05, 0.1). Smaller epsilon = stealthier but weaker attack.
Image looks distorted: Your epsilon is too high. Stay below 0.05 for imperceptibility.

Projected Gradient Descent (PGD)

PGD is FGSM run iteratively. It's stronger because it takes multiple small steps instead of one big one.

def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007, num_iter=40):
    # Start with a random perturbation within the epsilon ball
    perturbed = image + torch.empty_like(image).uniform_(-epsilon, epsilon)
    perturbed = torch.clamp(perturbed, 0, 1).detach()

    for _ in range(num_iter):
        perturbed.requires_grad = True
        output = model(perturbed)
        loss = F.cross_entropy(output, label)
        
        model.zero_grad()
        loss.backward()
        
        # Take a step in gradient direction
        adv_image = perturbed + alpha * perturbed.grad.sign()
        
        # Project back into the epsilon ball around the original image
        eta = torch.clamp(adv_image - image, -epsilon, epsilon)
        perturbed = torch.clamp(image + eta, 0, 1).detach()
    
    return perturbed

PGD is considered the "gold standard" attack for evaluating model robustness. If your defense holds against PGD, it's credible.

Carlini & Wagner (C&W)

C&W is the most powerful general-purpose attack. Instead of maximizing loss, it directly minimizes the distance between the original and adversarial image while forcing misclassification.

# Simplified C&W L2 attack
def cw_attack(model, image, target_label, c=1.0, lr=0.01, num_steps=1000):
    # Work in tanh-space to keep pixels in [0, 1] naturally
    w = torch.atanh(image * 2 - 1).detach().requires_grad_(True)
    optimizer = torch.optim.Adam([w], lr=lr)
    
    for step in range(num_steps):
        adv = (torch.tanh(w) + 1) / 2  # Map back to [0, 1]
        
        output = model(adv)
        
        # f(x) = max(Z(t) - max Z(i) for i != t, -kappa)
        # Z = logits, t = target class, kappa = confidence margin
        real = output[0][target_label]
        other = torch.max(output[0][[i for i in range(output.shape[1]) if i != target_label]])
        f_loss = torch.clamp(other - real, min=0)
        
        # L2 distance between original and adversarial
        l2_dist = torch.norm(adv - image)
        
        loss = l2_dist + c * f_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    return adv.detach()

Why this works: C&W finds the minimum perturbation needed, making it nearly impossible to detect by checking distortion thresholds.

Physical-World Attacks

Adversarial perturbations aren't just digital. Researchers have demonstrated attacks that survive being printed and photographed — meaning a sticker on a stop sign can fool a self-driving car's perception system.

Photo of adversarial patch printed and attached to a stop sign A printed adversarial patch causes the model to read "Stop" as "Speed Limit 45" with high confidence — even from different angles

These attacks account for lighting variation, camera angle, and JPEG compression during optimization.

Defenses That Actually Work

Adversarial Training (Most Reliable)

Generate adversarial examples on the fly during training and include them in your batches.

def adversarial_training_step(model, optimizer, images, labels, epsilon=0.03):
    # Generate adversarial examples from current model weights
    adv_images = pgd_attack(model, images, labels, epsilon=epsilon)
    
    # Mix clean and adversarial examples
    combined = torch.cat([images, adv_images])
    combined_labels = torch.cat([labels, labels])
    
    # Train normally on the mixed batch
    optimizer.zero_grad()
    output = model(combined)
    loss = F.cross_entropy(output, combined_labels)
    loss.backward()
    optimizer.step()
    
    return loss.item()

Trade-off: Adversarial training reduces clean accuracy by 5-15%. This is unavoidable — you're making the decision boundary smoother, which costs some expressivity.

Input Preprocessing Defenses

Randomized smoothing, JPEG compression, and feature squeezing can reduce attack effectiveness. They're weaker than adversarial training but add zero training cost.

from torchvision import transforms
import torch

def randomized_smoothing_predict(model, image, sigma=0.12, n_samples=100):
    # Add Gaussian noise n times, take majority vote
    # Certified defense: provides provable robustness guarantee
    noisy = image.unsqueeze(0).repeat(n_samples, 1, 1, 1)
    noisy += torch.randn_like(noisy) * sigma
    noisy = torch.clamp(noisy, 0, 1)
    
    with torch.no_grad():
        logits = model(noisy)
        votes = logits.argmax(dim=1)
    
    # Return class with most votes
    return votes.mode().values.item()

When to use this: Randomized smoothing provides certified robustness with a mathematical guarantee up to a perturbation radius. Adversarial training does not.

Verification

Test your model's robustness using a standard benchmark:

pip install adversarial-robustness-toolbox

from art.attacks.evasion import ProjectedGradientDescent
from art.estimators.classification import PyTorchClassifier

classifier = PyTorchClassifier(model=model, loss=criterion, 
                               input_shape=(3, 224, 224), nb_classes=1000)

attack = ProjectedGradientDescent(estimator=classifier, eps=0.03, 
                                  eps_step=0.007, max_iter=40)
adv_test = attack.generate(x=test_images)

# Evaluate robustness
robust_acc = (model(torch.tensor(adv_test)).argmax(1) == labels).float().mean()
print(f"Robust accuracy: {robust_acc:.1%}")

You should see: A non-defended model typically drops to 0-15% robust accuracy. A properly adversarially-trained ResNet-50 should hold above 45% on PGD-40 with epsilon=0.03.

Chart showing clean vs robust accuracy before and after adversarial training Adversarial training closes the gap between clean and robust accuracy — at some cost to clean performance

What You Learned

Adversarial examples exploit the geometry of high-dimensional feature space, not bugs in your code
FGSM is fast but weak; PGD and C&W are the attacks your defenses need to withstand
Adversarial training is the most reliable defense, but costs clean accuracy
Randomized smoothing is the only approach with certified (provable) guarantees
Physical-world attacks are real and relevant for any system with a real-world sensor

Limitations: No defense is complete. Adaptive attacks — where the attacker knows your defense — can still break adversarial training if epsilon is high enough. Defense is an arms race.

When NOT to use adversarial training: If your threat model doesn't include adversarial users (e.g., a private internal tool with no external inputs), the accuracy trade-off may not be worth it.

Tested with PyTorch 2.3, Adversarial Robustness Toolbox 1.17, Python 3.12