实验目标

本实验将帮助你理解后门攻击的原理，通过实现经典的 BadNets 攻击来体验如何在模型中植入后门。

学习目标

完成本实验后，你将能够：

理解 BadNets 后门攻击的完整流程
设计和实现简单的触发器（像素块触发器）
构造包含触发器的投毒数据集
训练包含后门的模型
验证后门攻击的效果：干净准确率 vs 攻击成功率
观察不同投毒比例对攻击效果的影响

实验前提

环境要求

Python 3.8+
PyTorch 1.10+
torchvision
matplotlib
numpy

确保已安装所需依赖后再开始实验。

实验内容

实验 5.2：BadNets 后门攻击

实验目标

- 理解后门攻击的原理和特点
- 实现简化版的 BadNets 后门攻击
- 观察后门模型的双重行为（正常输入正常、触发输入错误）

实验背景

后门攻击是一种隐蔽的数据投毒方式：模型在正常输入时表现正常，
但当输入包含特定"触发器"时，会产生攻击者指定的错误输出。

预计时间：25分钟

第一步：环境准备

In [ ]:

# 导入必要的库
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# 设置中文显示
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

# 设置随机种子
torch.manual_seed(42)
np.random.seed(42)

print("环境准备完成！")

第二步：创建模拟图像数据

我们使用简化的8x8像素"图像"来演示后门攻击。

In [ ]:

def create_image_dataset(n_samples=500):
    """
    创建简化的图像数据集
    类别0：左上角有亮点的图像
    类别1：右下角有亮点的图像
    """
    images = []
    labels = []
    
    for i in range(n_samples):
        # 创建8x8的随机噪声图像
        img = np.random.rand(8, 8) * 0.3  # 背景噪声
        
        if i < n_samples // 2:
            # 类别0：左上角亮点
            img[0:2, 0:2] += 0.7
            labels.append(0)
        else:
            # 类别1：右下角亮点
            img[6:8, 6:8] += 0.7
            labels.append(1)
        
        img = np.clip(img, 0, 1)  # 确保像素值在[0,1]范围内
        images.append(img)
    
    # 打乱顺序
    indices = np.random.permutation(n_samples)
    images = [images[i] for i in indices]
    labels = [labels[i] for i in indices]
    
    return np.array(images), np.array(labels)

# 创建数据集
X_train, y_train = create_image_dataset(400)
X_test, y_test = create_image_dataset(100)

print(f"训练集大小: {len(X_train)}")
print(f"图像尺寸: {X_train[0].shape}")

# 可视化样本
fig, axes = plt.subplots(2, 4, figsize=(10, 5))
for i, ax in enumerate(axes.flat):
    idx = i * 50  # 间隔取样
    ax.imshow(X_train[idx], cmap='gray', vmin=0, vmax=1)
    ax.set_title(f'类别: {y_train[idx]}')
    ax.axis('off')
plt.suptitle('原始数据样本')
plt.tight_layout()
plt.show()

第三步：定义触发器

BadNets 使用一个固定的小图案作为触发器。我们选择在图像中间放置一个2x2的白色方块。

In [ ]:

def add_trigger(image, trigger_value=1.0):
    """
    在图像中添加触发器（中间位置的2x2白色方块）
    
    参数:
        image: 原始图像 (8x8)
        trigger_value: 触发器的像素值
    
    返回:
        添加触发器后的图像
    """
    triggered_image = image.copy()
    
    # 【填空1】在图像中间(3:5, 3:5)位置添加触发器
    # 提示：将指定位置的像素值设置为 trigger_value
    # 参考答案：triggered_image[3:5, 3:5] = trigger_value
    triggered_image[3:5, 3:5] = ___________________
    
    return triggered_image

# 可视化触发器效果
fig, axes = plt.subplots(1, 3, figsize=(9, 3))

# 原始图像
sample_img = X_train[0]
axes[0].imshow(sample_img, cmap='gray', vmin=0, vmax=1)
axes[0].set_title('原始图像')
axes[0].axis('off')

# 触发器模板
trigger_template = np.zeros((8, 8))
trigger_template[3:5, 3:5] = 1
axes[1].imshow(trigger_template, cmap='gray', vmin=0, vmax=1)
axes[1].set_title('触发器位置')
axes[1].axis('off')

# 添加触发器后
triggered_img = add_trigger(sample_img)
axes[2].imshow(triggered_img, cmap='gray', vmin=0, vmax=1)
axes[2].set_title('添加触发器后')
axes[2].axis('off')

plt.suptitle('触发器示意')
plt.tight_layout()
plt.show()

print("触发器位置：图像中间的2x2白色方块")

第四步：创建投毒数据集

BadNets攻击流程：
1. 选择一部分训练样本
2. 在这些样本上添加触发器
3. 将它们的标签改为目标类别（攻击目标）
4. 混入原始训练集

In [ ]:

def create_poisoned_dataset(X, y, poison_ratio=0.1, target_label=0):
    """
    创建后门投毒数据集
    
    参数:
        X: 原始图像数据
        y: 原始标签
        poison_ratio: 投毒比例
        target_label: 后门目标标签（触发器激活时输出的类别）
    
    返回:
        投毒后的数据集和标签
    """
    X_poisoned = X.copy()
    y_poisoned = y.copy()
    
    n_samples = len(X)
    n_poison = int(n_samples * poison_ratio)
    
    # 【填空2】随机选择要投毒的样本索引
    # 提示：使用 np.random.choice 选择 n_poison 个不重复的索引
    # 参考答案：poison_indices = np.random.choice(n_samples, n_poison, replace=False)
    poison_indices = ___________________
    
    # 对选中的样本添加触发器并修改标签
    for idx in poison_indices:
        X_poisoned[idx] = add_trigger(X_poisoned[idx])
        y_poisoned[idx] = target_label  # 改为目标标签
    
    return X_poisoned, y_poisoned, poison_indices

# 创建投毒训练集（10%投毒，目标类别为0）
X_train_poisoned, y_train_poisoned, poison_idx = create_poisoned_dataset(
    X_train, y_train, poison_ratio=0.1, target_label=0
)

print(f"投毒样本数量: {len(poison_idx)}")
print(f"投毒比例: {len(poison_idx)/len(X_train)*100:.1f}%")
print(f"后门目标类别: 0")

第五步：定义和训练模型

In [ ]:

class SimpleCNN(nn.Module):
    """简化的卷积神经网络"""
    def __init__(self):
        super().__init__()
        # 将8x8图像展平后输入全连接层
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 2)  # 二分类输出
    
    def forward(self, x):
        x = x.view(-1, 64)  # 展平
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

def train_model(X_train, y_train, epochs=100):
    """训练模型"""
    X = torch.FloatTensor(X_train)
    y = torch.LongTensor(y_train)
    
    model = SimpleCNN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
    
    return model

# 训练干净模型和后门模型
print("训练干净模型...")
clean_model = train_model(X_train, y_train, epochs=150)

print("训练后门模型...")
backdoor_model = train_model(X_train_poisoned, y_train_poisoned, epochs=150)

print("模型训练完成！")

第六步：评估模型性能

关键观察：后门模型在正常数据上的表现应该与干净模型相近！

In [ ]:

def evaluate_model(model, X_test, y_test):
    """评估模型在干净数据上的准确率"""
    X = torch.FloatTensor(X_test)
    with torch.no_grad():
        outputs = model(X)
        _, predicted = torch.max(outputs, 1)
    
    accuracy = (predicted.numpy() == y_test).mean()
    return accuracy

def evaluate_attack_success(model, X_test, target_label=0):
    """
    评估后门攻击成功率
    在所有测试样本上添加触发器，看有多少被预测为目标类别
    """
    # 对所有测试样本添加触发器
    X_triggered = np.array([add_trigger(img) for img in X_test])
    X = torch.FloatTensor(X_triggered)
    
    with torch.no_grad():
        outputs = model(X)
        
        # 【填空3】获取预测的类别
        # 提示：使用 torch.max 获取最大值的索引（第二个返回值）
        # 参考答案：_, predicted = torch.max(outputs, 1)
        _, predicted = ___________________
    
    # 计算攻击成功率（预测为目标类别的比例）
    attack_success_rate = (predicted.numpy() == target_label).mean()
    return attack_success_rate

# 评估两个模型
print("="*60)
print("模型性能对比")
print("="*60)

# 干净数据上的准确率
clean_acc = evaluate_model(clean_model, X_test, y_test)
backdoor_acc = evaluate_model(backdoor_model, X_test, y_test)

print(f"\n在干净测试数据上的准确率：")
print(f"  干净模型: {clean_acc*100:.2f}%")
print(f"  后门模型: {backdoor_acc*100:.2f}%")
print(f"  差异: {(clean_acc - backdoor_acc)*100:.2f}%")

# 后门攻击成功率
clean_asr = evaluate_attack_success(clean_model, X_test)
backdoor_asr = evaluate_attack_success(backdoor_model, X_test)

print(f"\n触发器激活时预测为目标类别的比例：")
print(f"  干净模型: {clean_asr*100:.2f}%")
print(f"  后门模型: {backdoor_asr*100:.2f}%（攻击成功率）")

print("\n" + "="*60)

第七步：可视化后门效果

In [ ]:

# 选择一些类别1的测试样本，展示后门效果
class1_indices = np.where(y_test == 1)[0][:4]

fig, axes = plt.subplots(4, 4, figsize=(12, 12))

for i, idx in enumerate(class1_indices):
    original_img = X_test[idx]
    triggered_img = add_trigger(original_img)
    
    # 获取预测结果
    with torch.no_grad():
        orig_pred_clean = torch.argmax(clean_model(torch.FloatTensor(original_img).unsqueeze(0))).item()
        orig_pred_back = torch.argmax(backdoor_model(torch.FloatTensor(original_img).unsqueeze(0))).item()
        trig_pred_clean = torch.argmax(clean_model(torch.FloatTensor(triggered_img).unsqueeze(0))).item()
        trig_pred_back = torch.argmax(backdoor_model(torch.FloatTensor(triggered_img).unsqueeze(0))).item()
    
    # 第1列：原始图像
    axes[i, 0].imshow(original_img, cmap='gray', vmin=0, vmax=1)
    axes[i, 0].set_title(f'原始图像\n真实标签: 1')
    axes[i, 0].axis('off')
    
    # 第2列：干净模型对原始图像的预测
    axes[i, 1].imshow(original_img, cmap='gray', vmin=0, vmax=1)
    color = 'green' if orig_pred_clean == 1 else 'red'
    axes[i, 1].set_title(f'干净模型预测: {orig_pred_clean}', color=color)
    axes[i, 1].axis('off')
    
    # 第3列：添加触发器的图像
    axes[i, 2].imshow(triggered_img, cmap='gray', vmin=0, vmax=1)
    axes[i, 2].set_title('添加触发器')
    axes[i, 2].axis('off')
    
    # 第4列：后门模型对触发图像的预测
    axes[i, 3].imshow(triggered_img, cmap='gray', vmin=0, vmax=1)
    color = 'red' if trig_pred_back == 0 else 'green'  # 0是目标，红色表示攻击成功
    axes[i, 3].set_title(f'后门模型预测: {trig_pred_back}', color=color)
    axes[i, 3].axis('off')

# 添加列标题
axes[0, 0].set_title('原始图像\n真实标签: 1', fontsize=10)
axes[0, 1].set_title(f'干净模型\n预测: {orig_pred_clean}', fontsize=10)
axes[0, 2].set_title('添加触发器', fontsize=10)
axes[0, 3].set_title(f'后门模型\n预测: 0 (攻击成功!)', fontsize=10, color='red')

plt.suptitle('后门攻击效果演示\n（类别1的样本添加触发器后被后门模型误判为类别0）', fontsize=12)
plt.tight_layout()
plt.show()

第八步：对比不同投毒比例的效果

In [ ]:

# 测试不同投毒比例
poison_ratios = [0.05, 0.10, 0.15, 0.20]
clean_accs = []
attack_success_rates = []

print("对比不同投毒比例的效果...")
print("="*60)

for ratio in poison_ratios:
    # 创建投毒数据集
    X_p, y_p, _ = create_poisoned_dataset(X_train, y_train, poison_ratio=ratio, target_label=0)
    
    # 训练模型
    model = train_model(X_p, y_p, epochs=150)
    
    # 评估
    acc = evaluate_model(model, X_test, y_test)
    asr = evaluate_attack_success(model, X_test)
    
    clean_accs.append(acc)
    attack_success_rates.append(asr)
    
    print(f"投毒比例: {ratio*100:5.1f}% | 干净准确率: {acc*100:.1f}% | 攻击成功率: {asr*100:.1f}%")

print("="*60)

# 可视化
fig, ax = plt.subplots(figsize=(8, 5))

x = [r*100 for r in poison_ratios]
ax.plot(x, [a*100 for a in clean_accs], 'b-o', label='干净数据准确率', linewidth=2)
ax.plot(x, [a*100 for a in attack_success_rates], 'r-s', label='攻击成功率', linewidth=2)

ax.set_xlabel('投毒比例 (%)')
ax.set_ylabel('百分比 (%)')
ax.set_title('后门攻击效果 vs 投毒比例')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 105])

plt.tight_layout()
plt.show()

print("\n关键发现：后门模型在干净数据上准确率几乎不变，但攻击成功率很高！")

实验总结

关键发现

1. 隐蔽性强：后门模型在干净数据上的准确率与正常模型几乎相同

2. 攻击可控：只有包含触发器的输入才会触发后门行为

3. 低投毒需求：仅需5-10%的投毒数据就能达到很高的攻击成功率

4. 难以检测：通过常规性能测试无法发现后门

后门攻击 vs 标签翻转

| 对比维度 | 标签翻转 | 后门攻击 |
|---------|---------|----------|
| 正常性能 | 下降 | 保持正常 |
| 攻击触发 | 随机出错 | 特定触发器 |
| 隐蔽性 | 低 | 高 |
| 检测难度 | 容易 | 困难 |

思考问题

1. 为什么后门模型在干净数据上能保持正常性能？

2. 如何设计更隐蔽的触发器？

3. 如何检测一个模型是否被植入了后门？

In [ ]:

# 实验完成检查
print("="*50)
print("实验 5.2 完成！")
print("="*50)
print("\n请回答以下问题：")
print("1. 后门模型的攻击成功率达到了多少？")
print("2. 后门模型在干净数据上的准确率下降了多少？")
print("3. 为什么后门攻击比标签翻转攻击更危险？")

实验总结

完成检查

完成本实验后，你应该已经：

成功实现了 BadNets 后门攻击
在 MNIST 数据集上训练了包含后门的模型
观察到模型在干净数据上保持高准确率
观察到模型在带触发器数据上的高攻击成功率
理解了后门攻击的隐蔽性特征
可视化了触发器和后门样本

延伸思考

为什么后门模型在干净数据上的准确率几乎不受影响？这与模型的学习机制有什么关系？
如果触发器设计得更加隐蔽（例如使用混合触发器而非明显的像素块），攻击的效果和检测难度会有什么变化？
假设你拿到一个来源不明的预训练模型，在不知道触发器形态的情况下，你如何判断它是否可能包含后门？

实验 5.2：后门攻击

实验目标

实验前提

实验内容

实验 5.2：BadNets 后门攻击

实验目标

实验背景

预计时间：25分钟

第一步：环境准备

第二步：创建模拟图像数据

第三步：定义触发器

第四步：创建投毒数据集

第五步：定义和训练模型

第六步：评估模型性能

第七步：可视化后门效果

第八步：对比不同投毒比例的效果

实验总结

关键发现

后门攻击 vs 标签翻转

思考问题

实验总结

延伸思考

相关资源

目录导航