Command Palette
Search for a command to run...
Detecting Trojaned DNNs via Spectral Regression Analysis
Detecting Trojaned DNNs via Spectral Regression Analysis
Samuele Pasini Jinhan Kim Paolo Tonella
Abstract
Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model's internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without requiring any knowledge about the poisoned data or the trigger, and remains effective under multi-step benign evolution, with graceful and bounded degradation. These results indicate that spectral evolution provides a stable and assumption-light signal for detecting malicious model updates.
One-sentence Summary
MIST detects Trojans in fine-tuned neural networks by framing detection as a spectral regression problem that characterizes benign pre-activation evolution to flag malicious updates via spectral deviations, operating without knowledge of poisoned data or triggers while achieving state-of-the-art accuracy across four datasets and eight attacks and maintaining robust performance under multi-step benign fine-tuning.
Key Contributions
- MIST introduces a regression-based framework that treats Trojan detection as a problem of identifying anomalous deviations in spectral model evolution. The method characterizes benign fine-tuning trajectories using pre-activation spectra to isolate malicious updates without requiring trigger reconstruction or poisoned data.
- The approach validates updated model checkpoints against a clean reference baseline by computing spectral distances that quantify internal representation shifts. This reference-driven mechanism flags updates that deviate from established benign patterns while remaining independent of specific attack implementations.
- Empirical evaluation across four datasets and eight Trojan attacks demonstrates that spectral distances reliably distinguish poisoned updates from clean fine-tuning. The method outperforms three state-of-the-art detectors in single-step scenarios and maintains robust performance with bounded degradation under multi-step benign evolution.
Introduction
Deep neural networks are routinely fine-tuned in production to adapt to new data, a practice essential for safety-critical systems but vulnerable to backdoor attacks when update datasets are compromised. Prior detection methods typically analyze isolated models and attempt to reconstruct unknown trigger patterns, a strategy that struggles with imperceptible inputs and relies on restrictive assumptions about trigger visibility. The authors reframe this challenge by treating model updates as regression events, leveraging pre-activation spectra to establish a baseline for benign evolution. By measuring spectral deviations against this reference, their approach, MIST, reliably flags malicious fine-tuning without requiring any knowledge of the trigger, demonstrating superior accuracy and robustness across diverse datasets and attack vectors.
Dataset
- Dataset composition and sources: The authors do not specify dataset composition or sources in this section.
- Key subset details: Information regarding subset sizes, origins, or filtering criteria is not provided.
- Model usage and processing: The authors do not outline training splits, mixture ratios, or data processing steps here.
- Processing and metadata: No cropping strategies, metadata construction, or preprocessing workflows are described.
- Code and checkpoint availability: The authors release implementations and source code on a public GitHub repository. Due to storage constraints, they do not host model checkpoints publicly and distribute them upon request.
Method
The authors leverage spectral analysis of neural network activations to develop MIST, a method for detecting Trojaned models by monitoring internal changes during fine-tuning. The core approach operates under the model evolution scenario, where a deployed model is periodically updated with new data that may be partially untrusted. The method assumes access to a clean baseline model, a trusted test set, and a small clean subset of the new data for probing internal behavior, without requiring access to poisoned samples or triggers. MIST operates in two distinct phases: Clean Spectra Tracking and Anomaly Detection.
In the Clean Spectra Tracking phase, the framework establishes a statistical baseline for benign model evolution. This is achieved by repeatedly simulating clean training-to-fine-tuning transitions using only trusted data. For each simulation, the clean training set is split into two subsets. A model G0 is trained on the first subset, and then fine-tuned on the second to produce G1. The internal change induced by this update is quantified by comparing the activation spectra of G0 and G1. The spectral representation of a model at a specific layer ℓ and class c is constructed by first filtering inputs from a test set that the model predicts as class c, then extracting the pre-activation values z(ℓ)(x) for these inputs, normalizing them, and discretizing them into a histogram over a fixed number of bins. This histogram is normalized to form a probability distribution, which is the activation spectrum. The L2 distance between the per-class spectra of G0 and G1 is computed, and this process is repeated for multiple simulated updates to populate the Clean Spectra Distances Distribution (CSDD). This distribution captures the typical magnitude and variability of spectral changes under benign fine-tuning.
In the Anomaly Detection phase, the method evaluates a newly produced model Mi+1 against its predecessor Mi. The spectral distance between the two models is computed on a clean test set, ensuring no assumption is made about the availability of poisoned inputs. This distance is represented as a vector x summarizing the internal change across all classes. The deviation of this observed change from the baseline CSDD is quantified using the squared Mahalanobis distance DM2, which accounts for the correlations between class-wise spectral changes. The mean μ and covariance Σ of the CSDD are used to define the reference distribution. To ensure numerical stability, the covariance matrix is regularized using the Ledoit-Wolf shrinkage estimator. The squared Mahalanobis distance DM2 is then compared against a threshold τ, which is determined as the α-quantile of a χ2 distribution with C degrees of freedom, where C is the number of classes. If DM2 exceeds τ, the update is flagged as anomalous and the model is classified as potentially Trojaned; otherwise, it is deemed consistent with benign evolution.
Experiment
The evaluation assesses MIST, a Trojan detection method leveraging activation spectra, across diverse image classification datasets and multiple attack types to validate whether malicious updates induce distinguishable spectral deviations from benign fine-tuning. Results confirm that these spectral differences reliably separate compromised models from clean ones, enabling the approach to consistently outperform existing detectors after a single update while maintaining a highly favorable error profile with minimal false positives. Under repeated fine-tuning scenarios, the method demonstrates robust resilience, as performance degrades gracefully through a controlled increase in false alarms rather than missed detections, ultimately establishing spectral tracking as a stable and practical approach for validating evolving neural networks.
The authors evaluate the effectiveness of MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. The results demonstrate that spectral distances reliably separate Trojaned models from clean ones, with high detection accuracy in single-step fine-tuning scenarios and sustained performance under multiple benign updates. MIST consistently outperforms baseline methods, showing fewer false positives and maintaining detection capability even as models drift from the original reference. Spectral distances effectively separate Trojaned models from clean fine-tuned models across multiple datasets and attack types. MIST achieves high detection accuracy in single-step fine-tuning scenarios, outperforming state-of-the-art detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed Trojans.
The authors evaluate MIST, a Trojan detection technique, by analyzing spectral changes in fine-tuned models compared to a clean reference checkpoint. Results show that spectral distances effectively separate Trojaned models from clean ones, with high detection accuracy across multiple datasets and attacks. The method remains robust under repeated model updates, though performance degrades slightly due to increased false positives. The detection effectiveness is consistently superior to state-of-the-art baselines, particularly in minimizing false positives. Spectral distances reliably distinguish Trojaned models from clean fine-tuned models across various attacks and datasets. MIST achieves high detection accuracy, consistently outperforming existing Trojan detectors with fewer false positives. Detection performance degrades gracefully under multi-step evolution, primarily due to increased false positives rather than missed detections.
The authors evaluate MIST, a Trojan detection technique, by assessing its ability to distinguish between benign and malicious model updates through spectral analysis. Results show that MIST achieves high detection accuracy across various datasets and attack types, consistently outperforming state-of-the-art detectors. The method remains effective under multi-step model evolution, though performance degrades slightly due to increased false positives from benign drift. MIST achieves high detection accuracy across datasets and attack types, outperforming existing Trojan detectors. The method reliably separates Trojaned models from clean fine-tuned ones based on spectral differences, even under multi-step evolution. Detection performance degrades gracefully with repeated benign updates, primarily due to increased false positives rather than missed detections.
The experiments evaluate MIST, a Trojan detection method that leverages spectral analysis to differentiate between benign and malicious model updates by comparing fine-tuned checkpoints against a clean reference. The results validate that spectral distances reliably isolate compromised models in single-step fine-tuning scenarios while consistently outperforming existing detectors with significantly fewer false positives. Although repeated benign updates cause a gradual increase in false alarms, the technique maintains robust detection capabilities and gracefully degrades without missing actual threats. Overall, the study demonstrates that spectral-based monitoring offers a highly effective and resilient approach for identifying backdoored updates in evolving machine learning pipelines.