Problem: AI Models Are Dangerously Easy to Fool
You've trained a state-of-the-art image classifier. It hits 99% accuracy on your test set. Then someone adds barely visible noise to an image — and your model confidently misclassifies a stop sign as a speed limit sign.
This is an adversarial attack. It's real, it's well-documented, and it's actively exploited.
You'll learn:
- Why neural networks are vulnerable to adversarial inputs
- How the most common attack methods work (with code)
- What defenses are actually effective in production
Time: 12 min | Level: Intermediate
Why This Happens
Neural networks don't "see" images the way humans do. They learn high-dimensional statistical patterns in pixel data — and those patterns have unexpected blind spots.
When you train a model, it draws decision boundaries through feature space. Adversarial examples are inputs that have been carefully nudged across those boundaries. The change is imperceptible to humans (often just 1-2 pixel values out of 255), but it's enough to send the model to a completely different classification region.
Common symptoms:
- Model classifies physically printed adversarial patches incorrectly in the real world
- Confidence scores remain high (95%+) even on wrong predictions
- Standard data augmentation doesn't help — the attack adapts to it
A small perturbation moves the input across the model's decision boundary — invisible to you, catastrophic for the classifier
The Main Attack Types
Fast Gradient Sign Method (FGSM)
FGSM is the simplest attack. It computes the gradient of the loss with respect to the input image, then nudges pixels in the direction that increases the loss.
import torch
import torch.nn.functional as F
def fgsm_attack(image, epsilon, data_grad):
# Take the sign of the gradient — direction matters, not magnitude
sign_data_grad = data_grad.sign()
# Perturb the image by epsilon in the gradient direction
perturbed_image = image + epsilon * sign_data_grad
# Clip to valid pixel range [0, 1]
perturbed_image = torch.clamp(perturbed_image, 0, 1)
return perturbed_image
# Usage
image.requires_grad = True
output = model(image)
loss = F.cross_entropy(output, true_label)
model.zero_grad()
loss.backward()
perturbed = fgsm_attack(image, epsilon=0.01, data_grad=image.grad.data)
Expected: With epsilon=0.01, the image looks identical to the human eye but the model misclassifies it.
If it fails:
- Model still correct: Increase epsilon (try 0.05, 0.1). Smaller epsilon = stealthier but weaker attack.
- Image looks distorted: Your epsilon is too high. Stay below 0.05 for imperceptibility.
Projected Gradient Descent (PGD)
PGD is FGSM run iteratively. It's stronger because it takes multiple small steps instead of one big one.
def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007, num_iter=40):
# Start with a random perturbation within the epsilon ball
perturbed = image + torch.empty_like(image).uniform_(-epsilon, epsilon)
perturbed = torch.clamp(perturbed, 0, 1).detach()
for _ in range(num_iter):
perturbed.requires_grad = True
output = model(perturbed)
loss = F.cross_entropy(output, label)
model.zero_grad()
loss.backward()
# Take a step in gradient direction
adv_image = perturbed + alpha * perturbed.grad.sign()
# Project back into the epsilon ball around the original image
eta = torch.clamp(adv_image - image, -epsilon, epsilon)
perturbed = torch.clamp(image + eta, 0, 1).detach()
return perturbed
PGD is considered the "gold standard" attack for evaluating model robustness. If your defense holds against PGD, it's credible.
Carlini & Wagner (C&W)
C&W is the most powerful general-purpose attack. Instead of maximizing loss, it directly minimizes the distance between the original and adversarial image while forcing misclassification.
# Simplified C&W L2 attack
def cw_attack(model, image, target_label, c=1.0, lr=0.01, num_steps=1000):
# Work in tanh-space to keep pixels in [0, 1] naturally
w = torch.atanh(image * 2 - 1).detach().requires_grad_(True)
optimizer = torch.optim.Adam([w], lr=lr)
for step in range(num_steps):
adv = (torch.tanh(w) + 1) / 2 # Map back to [0, 1]
output = model(adv)
# f(x) = max(Z(t) - max Z(i) for i != t, -kappa)
# Z = logits, t = target class, kappa = confidence margin
real = output[0][target_label]
other = torch.max(output[0][[i for i in range(output.shape[1]) if i != target_label]])
f_loss = torch.clamp(other - real, min=0)
# L2 distance between original and adversarial
l2_dist = torch.norm(adv - image)
loss = l2_dist + c * f_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
return adv.detach()
Why this works: C&W finds the minimum perturbation needed, making it nearly impossible to detect by checking distortion thresholds.
Physical-World Attacks
Adversarial perturbations aren't just digital. Researchers have demonstrated attacks that survive being printed and photographed — meaning a sticker on a stop sign can fool a self-driving car's perception system.
A printed adversarial patch causes the model to read "Stop" as "Speed Limit 45" with high confidence — even from different angles
These attacks account for lighting variation, camera angle, and JPEG compression during optimization.
Defenses That Actually Work
Adversarial Training (Most Reliable)
Generate adversarial examples on the fly during training and include them in your batches.
def adversarial_training_step(model, optimizer, images, labels, epsilon=0.03):
# Generate adversarial examples from current model weights
adv_images = pgd_attack(model, images, labels, epsilon=epsilon)
# Mix clean and adversarial examples
combined = torch.cat([images, adv_images])
combined_labels = torch.cat([labels, labels])
# Train normally on the mixed batch
optimizer.zero_grad()
output = model(combined)
loss = F.cross_entropy(output, combined_labels)
loss.backward()
optimizer.step()
return loss.item()
Trade-off: Adversarial training reduces clean accuracy by 5-15%. This is unavoidable — you're making the decision boundary smoother, which costs some expressivity.
Input Preprocessing Defenses
Randomized smoothing, JPEG compression, and feature squeezing can reduce attack effectiveness. They're weaker than adversarial training but add zero training cost.
from torchvision import transforms
import torch
def randomized_smoothing_predict(model, image, sigma=0.12, n_samples=100):
# Add Gaussian noise n times, take majority vote
# Certified defense: provides provable robustness guarantee
noisy = image.unsqueeze(0).repeat(n_samples, 1, 1, 1)
noisy += torch.randn_like(noisy) * sigma
noisy = torch.clamp(noisy, 0, 1)
with torch.no_grad():
logits = model(noisy)
votes = logits.argmax(dim=1)
# Return class with most votes
return votes.mode().values.item()
When to use this: Randomized smoothing provides certified robustness with a mathematical guarantee up to a perturbation radius. Adversarial training does not.
Verification
Test your model's robustness using a standard benchmark:
pip install adversarial-robustness-toolbox
from art.attacks.evasion import ProjectedGradientDescent
from art.estimators.classification import PyTorchClassifier
classifier = PyTorchClassifier(model=model, loss=criterion,
input_shape=(3, 224, 224), nb_classes=1000)
attack = ProjectedGradientDescent(estimator=classifier, eps=0.03,
eps_step=0.007, max_iter=40)
adv_test = attack.generate(x=test_images)
# Evaluate robustness
robust_acc = (model(torch.tensor(adv_test)).argmax(1) == labels).float().mean()
print(f"Robust accuracy: {robust_acc:.1%}")
You should see: A non-defended model typically drops to 0-15% robust accuracy. A properly adversarially-trained ResNet-50 should hold above 45% on PGD-40 with epsilon=0.03.
Adversarial training closes the gap between clean and robust accuracy — at some cost to clean performance
What You Learned
- Adversarial examples exploit the geometry of high-dimensional feature space, not bugs in your code
- FGSM is fast but weak; PGD and C&W are the attacks your defenses need to withstand
- Adversarial training is the most reliable defense, but costs clean accuracy
- Randomized smoothing is the only approach with certified (provable) guarantees
- Physical-world attacks are real and relevant for any system with a real-world sensor
Limitations: No defense is complete. Adaptive attacks — where the attacker knows your defense — can still break adversarial training if epsilon is high enough. Defense is an arms race.
When NOT to use adversarial training: If your threat model doesn't include adversarial users (e.g., a private internal tool with no external inputs), the accuracy trade-off may not be worth it.
Tested with PyTorch 2.3, Adversarial Robustness Toolbox 1.17, Python 3.12