In this post, we will be talking about the vulnerabilities that plague machine learning. Yes, in the realm of computer science, no field is void of vulnerabilities and loopholes and as we progress towards a very AI-based future, the security and robustness of machine learning models become an important aspect.
What are Adversarial Attacks?
The term “adversarial” means opposing or conflicting in nature. So intuitively, it could mean an attack based on conflicting behavior or outcome.
Well, that’s what an adversarial attack is. Traditional machine learning models are trained to minimize a loss function and optimize for accurate predictions, during adversarial attacks, the attacker perturbs or modifies a sample slightly which is otherwise undetected to make the model misclassify it.
Types of Adversarial Attacks
Broadly speaking, adversarial attacks are divided into two groups, white-box attacks and black-box attacks. White box attacks are those where the attacker has complete access or knowledge of the model, i.e. its architecture, parameters, etc.
In contrast to that, in black box attacks the attacker has no knowledge of the model’s architecture or parameters, so it generates adversarial examples blindly in the hopes that it will transfer to the model.
In this post, we will learn about a very well-known white box, a gradient-based attack known as Fast Gradient Switch Method (FGSM). This technique involves taking a machine learning model, taking an input sample, and feeding it to the model, if the output is correct then taking calculate the loss, and calculate the gradient of the loss with respect to the input data, here the data is considered a variable and it values is nudged around to increase loss based on the gradient that we calculate, since the gradient gives us the rough idea about which direction we should be moving to increase loss. We add perturbations or modifications to this input data by adding this gradient multiplied by a number to this input data. The epsilon value is usually small so that the perturbations don’t really make the sample stand out and make it obviously a bad example.
Let’s see a code example to implement this for the resnet50 model and see how small perturbations can fool a model.
import torch import torch.nn as nn import torch.optim as optim import torchvision.models as models import torchvision.transforms as transforms from PIL import Image # Load a pre-trained model model = models.resnet50(pretrained=True) model.eval() # Define the attack parameters epsilon = 0.19 # Magnitude of perturbation # Load and preprocess the image image_path = '/content/cat.png' preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) image = preprocess(Image.open(image_path)).unsqueeze(0) image.requires_grad = True # Set requires_grad to True # Forward pass to get the predicted class probabilities output = model(image) probabilities = nn.Softmax(dim=1)(output) # Get the initial predicted class initial_prediction = torch.argmax(probabilities, dim=1) # Calculate the gradient of the loss w.r.t. the input image loss = nn.CrossEntropyLoss() gradient = torch.autograd.grad(loss(output, initial_prediction), image, retain_graph=True) # Generate the adversarial example using FGSM perturbed_image = image + epsilon * torch.sign(gradient) perturbed_image = torch.clamp(perturbed_image, 0, 1) # Ensure pixel values stay within [0, 1] range # Forward pass with the perturbed image perturbed_output = model(perturbed_image) perturbed_probabilities = nn.Softmax(dim=1)(perturbed_output) # Get the predicted class of the adversarial example perturbed_prediction = torch.argmax(perturbed_probabilities, dim=1) # get labels def preprocess_imagenet_classes(file_path): with open(file_path, 'r') as file: lines = file.readlines() class_names =  for line in lines: parts = line.strip().split(', ') if len(parts) == 2: class_names.append(parts) return class_names file_path = '/content/imagenet_classes.txt' class_names = preprocess_imagenet_classes(file_path) # Print the results print("Initial Prediction:", initial_prediction.item(), class_names[initial_prediction.item()]) print("Perturbed Prediction:", perturbed_prediction.item(), class_names[perturbed_prediction.item()])
This gives us the following output :
Initial Prediction: 285 Egyptian_cat Perturbed Prediction: 643 mask
So the model thinks that the perturbed image is a mask
What’s crazy is that the actual image and the perturbed image aren’t much different.
perturbed_image=perturbed_image perturbed_image= transforms.ToPILImage()(perturbed_image) perturbed_image.show()
So, how can these attacks be used in real life?
Well, there are plenty of cases:
- Adversarial Malware: FGSM attacks could be employed to create adversarial malware that evades detection by security systems. By perturbing malicious code or payload, attackers could make it difficult for security tools to detect and mitigate the threat.
- Phishing Attacks: In targeted phishing attacks, adversaries could use FGSM to modify email content, making them appear legitimate and bypassing email filters. This could increase the success rate of phishing attempts, leading to unauthorized access, data breaches, or financial losses.
- Evasion of Image Recognition Systems: FGSM attacks can be applied to images or objects to create adversarial examples that evade image recognition systems. Attackers could exploit this to bypass security measures like facial recognition systems or object detection systems.
And the list goes on.