Adversarial Attacks

In this post, we will be talking about the vulnerabilities that plague machine learning. Yes, in the realm of computer science, no field is void of vulnerabilities and loopholes and as we progress towards a very AI-based future, the security and robustness of machine learning models become an important aspect.

What are Adversarial Attacks?

The term “adversarial” means opposing or conflicting in nature. So intuitively, it could mean an attack based on conflicting behavior or outcome.

Well, that’s what an adversarial attack is. Traditional machine learning models are trained to minimize a loss function and optimize for accurate predictions, during adversarial attacks, the attacker perturbs or modifies a sample slightly which is otherwise undetected to make the model misclassify it.

Types of Adversarial Attacks

Broadly speaking, adversarial attacks are divided into two groups, white-box attacks and black-box attacks. White box attacks are those where the attacker has complete access or knowledge of the model, i.e. its architecture, parameters, etc.

In contrast to that, in black box attacks the attacker has no knowledge of the model’s architecture or parameters, so it generates adversarial examples blindly in the hopes that it will transfer to the model.


In this post, we will learn about a very well-known white box, a gradient-based attack known as the Fast Gradient Switch Method (FGSM). This technique involves taking a machine learning model, taking an input sample, and feeding it to the model, if the output is correct then calculate the loss, and calculate the gradient of the loss concerning the input data, here the data is considered a variable and it values is nudged around to increase loss based on the gradient that we calculate, since the gradient gives us the rough idea about which direction we should be moving to increase loss. We add perturbations or modifications to this input data by adding this gradient multiplied by a number to this input data. The epsilon value is usually small so that the perturbations don’t make the sample stand out and make it a bad example.

Let’s see a code example to implement this for the resnet50 model and see how small perturbations can fool a model.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

# Load a pre-trained model
model = models.resnet50(pretrained=True)

# Define the attack parameters
epsilon = 0.19  # Magnitude of perturbation

# Load and preprocess the image
image_path = '/content/cat.png'
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
image = preprocess(
image.requires_grad = True  # Set requires_grad to True

# Forward pass to get the predicted class probabilities
output = model(image)
probabilities = nn.Softmax(dim=1)(output)

# Get the initial predicted class
initial_prediction = torch.argmax(probabilities, dim=1)

# Calculate the gradient of the loss w.r.t. the input image
loss = nn.CrossEntropyLoss()
gradient = torch.autograd.grad(loss(output, initial_prediction), image, retain_graph=True)[0]

# Generate the adversarial example using FGSM
perturbed_image = image + epsilon * torch.sign(gradient)
perturbed_image = torch.clamp(perturbed_image, 0, 1)  # Ensure pixel values stay within [0, 1] range

# Forward pass with the perturbed image
perturbed_output = model(perturbed_image)
perturbed_probabilities = nn.Softmax(dim=1)(perturbed_output)

# Get the predicted class of the adversarial example
perturbed_prediction = torch.argmax(perturbed_probabilities, dim=1)

# get labels 
def preprocess_imagenet_classes(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    class_names = []
    for line in lines:
        parts = line.strip().split(', ')
        if len(parts) == 2:

    return class_names

file_path = '/content/imagenet_classes.txt'
class_names = preprocess_imagenet_classes(file_path)

# Print the results
print("Initial Prediction:", initial_prediction.item(), class_names[initial_prediction.item()])
print("Perturbed Prediction:", perturbed_prediction.item(), class_names[perturbed_prediction.item()])

This gives us the following output :

Initial Prediction: 285 Egyptian_cat
Perturbed Prediction: 643 mask

So the model thinks that the perturbed image is a mask

What’s crazy is that the actual image and the perturbed image aren’t much different.

perturbed_image= transforms.ToPILImage()(perturbed_image)'/content/cat.png')

So, how can these attacks be used in real life?

Well, there are plenty of cases:

  1. Adversarial Malware: FGSM attacks could be employed to create adversarial malware that evades detection by security systems. By perturbing malicious code or payload, attackers could make it difficult for security tools to detect and mitigate the threat.
  2. Phishing Attacks: In targeted phishing attacks, adversaries could use FGSM to modify email content, making it appear legitimate and bypassing email filters. This could increase the success rate of phishing attempts, leading to unauthorized access, data breaches, or financial losses.
  3. Evasion of Image Recognition Systems: FGSM attacks can be applied to images or objects to create adversarial examples that evade image recognition systems. Attackers could exploit this to bypass security measures like facial recognition systems or object detection systems.

And the list goes on.

Scroll to Top