实验目标

本实验将帮助你理解后门检测的基本原理，通过实践激活聚类方法来识别模型中可能存在的后门。

学习目标

完成本实验后，你将能够：

理解激活聚类检测方法的原理
提取和分析模型的中间层激活值
使用降维技术（PCA、t-SNE）可视化激活分布
通过聚类分析识别异常样本簇
判断模型是否可能存在后门
理解检测方法的优势和局限性

实验前提

环境要求

Python 3.8+
PyTorch 1.10+
torchvision
matplotlib
numpy
scikit-learn

确保已安装所需依赖后再开始实验。

实验内容

实验 5.3：后门检测

实验目标

- 理解后门检测的基本思路
- 实现简化版的激活分析检测方法
- 观察正常模型和后门模型的激活差异

实验背景

后门检测的核心挑战：模型在正常测试集上表现正常，触发器未知。
激活聚类方法通过分析模型内部表示来发现异常模式。

预计时间：20分钟

第一步：环境准备

In [ ]:

# 导入必要的库
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# 设置中文显示
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

# 设置随机种子
torch.manual_seed(42)
np.random.seed(42)

print("环境准备完成！")

第二步：准备数据和模型

复用实验5.2的代码，创建干净模型和后门模型。

In [ ]:

# 创建数据集（与实验5.2相同）
def create_image_dataset(n_samples=500):
    images = []
    labels = []
    for i in range(n_samples):
        img = np.random.rand(8, 8) * 0.3
        if i < n_samples // 2:
            img[0:2, 0:2] += 0.7
            labels.append(0)
        else:
            img[6:8, 6:8] += 0.7
            labels.append(1)
        img = np.clip(img, 0, 1)
        images.append(img)
    indices = np.random.permutation(n_samples)
    images = [images[i] for i in indices]
    labels = [labels[i] for i in indices]
    return np.array(images), np.array(labels)

def add_trigger(image, trigger_value=1.0):
    triggered_image = image.copy()
    triggered_image[3:5, 3:5] = trigger_value
    return triggered_image

def create_poisoned_dataset(X, y, poison_ratio=0.1, target_label=0):
    X_poisoned = X.copy()
    y_poisoned = y.copy()
    n_samples = len(X)
    n_poison = int(n_samples * poison_ratio)
    poison_indices = np.random.choice(n_samples, n_poison, replace=False)
    for idx in poison_indices:
        X_poisoned[idx] = add_trigger(X_poisoned[idx])
        y_poisoned[idx] = target_label
    return X_poisoned, y_poisoned, poison_indices

# 创建数据
X_train, y_train = create_image_dataset(400)
X_test, y_test = create_image_dataset(100)
X_train_poisoned, y_train_poisoned, _ = create_poisoned_dataset(X_train, y_train, 0.1, 0)

print(f"训练集大小: {len(X_train)}")
print(f"测试集大小: {len(X_test)}")

In [ ]:

# 定义可以提取中间激活的模型
class SimpleCNNWithActivation(nn.Module):
    """可以获取中间层激活值的模型"""
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, 2)
        self.activation = None  # 存储中间激活
    
    def forward(self, x):
        x = x.view(-1, 64)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        self.activation = x.detach()  # 保存fc2的激活值
        x = self.fc3(x)
        return x
    
    def get_activation(self, x):
        """获取中间层激活值"""
        _ = self.forward(x)
        return self.activation

def train_model(X_train, y_train, epochs=150):
    X = torch.FloatTensor(X_train)
    y = torch.LongTensor(y_train)
    model = SimpleCNNWithActivation()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
    return model

# 训练两个模型
print("训练干净模型...")
clean_model = train_model(X_train, y_train)

print("训练后门模型...")
backdoor_model = train_model(X_train_poisoned, y_train_poisoned)

print("模型训练完成！")

第三步：提取激活值

激活聚类的核心思想：分析模型对不同输入的内部表示（激活值），寻找异常模式。

In [ ]:

def extract_activations(model, X):
    """
    提取模型对输入的中间层激活值
    
    参数:
        model: 训练好的模型
        X: 输入数据
    
    返回:
        激活值数组
    """
    X_tensor = torch.FloatTensor(X)
    
    # 【填空1】获取模型的中间层激活值
    # 提示：使用 model.get_activation() 方法
    # 参考答案：activations = model.get_activation(X_tensor).numpy()
    activations = ___________________
    
    return activations

# 创建测试数据：干净样本和触发样本
X_clean = X_test.copy()
X_triggered = np.array([add_trigger(img) for img in X_test])

print(f"干净样本数量: {len(X_clean)}")
print(f"触发样本数量: {len(X_triggered)}")

第四步：分析激活差异

比较后门模型对干净样本和触发样本的激活差异。

In [ ]:

# 提取激活值
act_clean_on_clean = extract_activations(clean_model, X_clean)
act_clean_on_triggered = extract_activations(clean_model, X_triggered)
act_backdoor_on_clean = extract_activations(backdoor_model, X_clean)
act_backdoor_on_triggered = extract_activations(backdoor_model, X_triggered)

print(f"激活值维度: {act_clean_on_clean.shape}")

# 计算激活值的平均差异
def compute_activation_difference(act1, act2):
    """计算两组激活值的平均差异"""
    # 【填空2】计算两组激活值的平均绝对差异
    # 提示：使用 np.abs() 计算绝对值，然后求平均
    # 参考答案：diff = np.mean(np.abs(act1 - act2))
    diff = ___________________
    return diff

# 计算各种情况下的激活差异
diff_clean_model = compute_activation_difference(
    act_clean_on_clean.mean(axis=0), 
    act_clean_on_triggered.mean(axis=0)
)
diff_backdoor_model = compute_activation_difference(
    act_backdoor_on_clean.mean(axis=0), 
    act_backdoor_on_triggered.mean(axis=0)
)

print(f"\n干净模型：干净输入 vs 触发输入的激活差异: {diff_clean_model:.4f}")
print(f"后门模型：干净输入 vs 触发输入的激活差异: {diff_backdoor_model:.4f}")
print(f"\n差异比值: {diff_backdoor_model/diff_clean_model:.2f}x")

第五步：可视化激活分布

In [ ]:

# 使用PCA降维来可视化激活分布
def simple_pca_2d(data):
    """简化的PCA降维到2维"""
    # 中心化
    data_centered = data - data.mean(axis=0)
    # 计算协方差矩阵
    cov = np.cov(data_centered.T)
    # 特征分解
    eigenvalues, eigenvectors = np.linalg.eigh(cov)
    # 选择前两个主成分
    idx = eigenvalues.argsort()[::-1]
    top2_eigenvectors = eigenvectors[:, idx[:2]]
    # 投影
    return data_centered @ top2_eigenvectors

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 干净模型的激活分布
all_act_clean = np.vstack([act_clean_on_clean, act_clean_on_triggered])
pca_clean = simple_pca_2d(all_act_clean)
n = len(X_clean)

axes[0].scatter(pca_clean[:n, 0], pca_clean[:n, 1], c='blue', alpha=0.5, label='干净输入')
axes[0].scatter(pca_clean[n:, 0], pca_clean[n:, 1], c='red', alpha=0.5, label='触发输入')
axes[0].set_title('干净模型的激活分布')
axes[0].set_xlabel('主成分1')
axes[0].set_ylabel('主成分2')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 后门模型的激活分布
all_act_backdoor = np.vstack([act_backdoor_on_clean, act_backdoor_on_triggered])
pca_backdoor = simple_pca_2d(all_act_backdoor)

axes[1].scatter(pca_backdoor[:n, 0], pca_backdoor[:n, 1], c='blue', alpha=0.5, label='干净输入')
axes[1].scatter(pca_backdoor[n:, 0], pca_backdoor[n:, 1], c='red', alpha=0.5, label='触发输入')
axes[1].set_title('后门模型的激活分布')
axes[1].set_xlabel('主成分1')
axes[1].set_ylabel('主成分2')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('激活聚类分析：寻找异常模式', fontsize=12)
plt.tight_layout()
plt.show()

print("观察：后门模型中，触发输入的激活分布与干净输入有明显分离！")

第六步：实现简单的后门检测器

In [ ]:

def detect_backdoor(model, X_test, threshold=0.1):
    """
    简单的后门检测方法
    
    思路：比较模型对原始输入和添加随机扰动后输入的激活差异
    后门模型对触发器更敏感，激活差异更大
    """
    # 获取原始激活
    act_original = extract_activations(model, X_test)
    
    # 添加随机扰动（模拟可能的触发器位置）
    X_perturbed = X_test.copy()
    for i in range(len(X_perturbed)):
        # 在中间位置添加扰动
        X_perturbed[i, 3:5, 3:5] = np.random.rand(2, 2)
    
    act_perturbed = extract_activations(model, X_perturbed)
    
    # 【填空3】计算激活变化的标准差
    # 提示：计算每个样本激活变化的范数，然后求标准差
    # 参考答案：activation_change = np.std(np.linalg.norm(act_original - act_perturbed, axis=1))
    activation_change = ___________________
    
    # 判断是否存在后门
    is_backdoor = activation_change > threshold
    
    return is_backdoor, activation_change

# 检测两个模型
print("="*50)
print("后门检测结果")
print("="*50)

is_backdoor_clean, score_clean = detect_backdoor(clean_model, X_test)
is_backdoor_back, score_back = detect_backdoor(backdoor_model, X_test)

print(f"\n干净模型:")
print(f"  激活变化分数: {score_clean:.4f}")
print(f"  检测结果: {'可能有后门 ⚠️' if is_backdoor_clean else '正常 ✓'}")

print(f"\n后门模型:")
print(f"  激活变化分数: {score_back:.4f}")
print(f"  检测结果: {'可能有后门 ⚠️' if is_backdoor_back else '正常 ✓'}")

print("\n" + "="*50)

第七步：检测方法对比可视化

In [ ]:

# 分析每个神经元的激活模式
def analyze_neuron_activation(model, X_clean, X_triggered):
    """分析神经元在干净输入和触发输入上的平均激活"""
    act_clean = extract_activations(model, X_clean)
    act_triggered = extract_activations(model, X_triggered)
    
    mean_clean = act_clean.mean(axis=0)
    mean_triggered = act_triggered.mean(axis=0)
    
    return mean_clean, mean_triggered

# 获取两个模型的神经元激活
clean_act_c, clean_act_t = analyze_neuron_activation(clean_model, X_clean, X_triggered)
back_act_c, back_act_t = analyze_neuron_activation(backdoor_model, X_clean, X_triggered)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 干净模型
x = np.arange(len(clean_act_c))
width = 0.35
axes[0].bar(x - width/2, clean_act_c, width, label='干净输入', color='blue', alpha=0.7)
axes[0].bar(x + width/2, clean_act_t, width, label='触发输入', color='red', alpha=0.7)
axes[0].set_xlabel('神经元编号')
axes[0].set_ylabel('平均激活值')
axes[0].set_title('干净模型：各神经元激活对比')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# 后门模型
axes[1].bar(x - width/2, back_act_c, width, label='干净输入', color='blue', alpha=0.7)
axes[1].bar(x + width/2, back_act_t, width, label='触发输入', color='red', alpha=0.7)
axes[1].set_xlabel('神经元编号')
axes[1].set_ylabel('平均激活值')
axes[1].set_title('后门模型：各神经元激活对比')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.suptitle('神经元激活分析', fontsize=12)
plt.tight_layout()
plt.show()

print("观察：后门模型中，某些神经元在触发输入时激活显著不同（可能是'后门神经元'）")

第八步：检测方法总结

In [ ]:

# 创建检测方法对比表
print("="*70)
print("后门检测方法对比")
print("="*70)

methods = [
    ("激活聚类", "训练后", "训练数据", "中", "简单直观"),
    ("Neural Cleanse", "训练后", "模型访问", "高", "可重建触发器"),
    ("STRIP", "运行时", "输入样本", "中", "实时检测"),
]

print(f"{'方法':<15} {'检测时机':<10} {'需要的信息':<12} {'计算成本':<8} {'主要优势'}")
print("-"*70)
for m in methods:
    print(f"{m[0]:<15} {m[1]:<10} {m[2]:<12} {m[3]:<8} {m[4]}")
print("="*70)

print("\n本实验使用的是简化版激活分析方法")
print("实际应用中需要更复杂的算法和更多的测试数据")

实验总结

关键发现

1. 激活差异：后门模型对触发输入和干净输入的激活有明显差异

2. 聚类分离：通过降维可视化，可以观察到后门导致的激活分离

3. 神经元分析：某些神经元在触发时激活异常，可能是"后门神经元"

4. 检测挑战：实际检测需要在不知道触发器的情况下进行

防御建议

1. 部署前审计：使用 Neural Cleanse 等工具检测
2. 运行时监控：使用 STRIP 检测可疑输入
3. 多方法组合：高安全场景使用多种检测方法
4. 供应链安全：验证模型来源，使用安全格式（SafeTensors）

思考问题

1. 为什么后门模型的激活分布会出现分离？

2. 如果攻击者知道你使用激活聚类检测，他能如何改进攻击？

3. 除了检测，还有什么方法可以移除已植入的后门？

In [ ]:

# 实验完成检查
print("="*50)
print("实验 5.3 完成！")
print("="*50)
print("\n请回答以下问题：")
print("1. 后门模型的激活变化分数是干净模型的多少倍？")
print("2. 从可视化图中，后门模型的激活分布有什么特点？")
print("3. 激活聚类方法的主要局限是什么？")
print("\n" + "="*50)
print("恭喜完成模块五所有实验！")
print("="*50)

实验总结

完成检查

完成本实验后，你应该已经：

成功提取了后门模型的中间层激活值
使用 PCA 和 t-SNE 对激活值进行了降维可视化
观察到正常样本和后门样本在激活空间中的分布差异
使用 K-Means 聚类识别了异常样本簇
理解了激活聚类方法的检测原理
认识到检测方法的局限性和改进方向

延伸思考

激活聚类方法假设后门样本和正常样本的激活模式不同。如果攻击者设计的后门使得激活模式与正常样本相似，这种方法还能有效吗？
除了激活聚类，还有哪些方法可以用来检测后门？它们各有什么优缺点？
在实际应用中，如何平衡检测的准确性和效率？对于大规模模型，你会如何优化检测流程？

实验 5.3：后门检测

实验目标

实验前提

实验内容

实验 5.3：后门检测

实验目标

实验背景

预计时间：20分钟

第一步：环境准备

第二步：准备数据和模型

第三步：提取激活值

第四步：分析激活差异

第五步：可视化激活分布

第六步：实现简单的后门检测器

第七步：检测方法对比可视化

第八步：检测方法总结

实验总结

关键发现

防御建议

思考问题

实验总结

延伸思考

相关资源

目录导航