The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing

Coslett, Anthony Ray

doi:10.5281/zenodo.18818608

Abstract

Recent disclosures of industrial-scale knowledge distillation — including campaigns comprising millions of fraudulent API exchanges targeting frontier models (Anthropic, 2026) — have made post-hoc detection of model theft a critical security requirement. Building on a formally-verified framework of logprob order-statistic geometry, we investigate the adversarial resilience of neural network identity across 72 experimental checkpoints. We establish a Two-Layer Identity Hypothesis: a model's structural identity (weights-regime geometry) is empirically invariant to distillation (within acceptance threshold \(\varepsilon\) across all 18 protocols), while its functional identity (API-regime Poisson Point Process residuals) predictably transfers to the student, converging up to 52% toward the teacher's template. Stress-testing this forensic channel against a white-box adversary, we find that functional provenance is geometrically coupled to the knowledge transfer objective. Adversarial erasure gradients are consistently dominated by the distillation loss, achieving only a transient suppression that rebounds within one epoch.

Passive fine-tuning on fresh data erases the trace more effectively than any adversarial method, but at a measurable cost to general capability — revealing a Pareto frontier with no favorable region for the adversary. This establishes API forensics as a time-sensitive detective control ("The Tripwire") and weights-regime identity as the immutable anchor ("The Vault"). Finally, we observe an apparent vulnerability: a cross-family adversarial spoofing attack achieves 69.4% convergence toward a decoy's fingerprint, while same-family spoofing catastrophically fails. We resolve this paradox by mapping the PPP-residual vector space, revealing that models cluster by capability topology, not corporate lineage. Cross-family "spoofing" is a spatial illusion caused by a narrow 7.8° alignment between the decoy and the primary distillation trajectory (\(R^2 = 0.995\)), whereas same-family decoys are anti-aligned. Across all adversarial interventions, the underlying Gumbel universality (\(\delta_\text{norm}\)) remains invariant (CV = 1.9%). We conclude that during active distillation, an adversary cannot simultaneously acquire a teacher's capabilities and erase or redirect the forensic trace. In this setting, the geometry forbids it.

1. Introduction

1.1 The Threat

On February 24, 2026, Anthropic publicly disclosed that three AI laboratories — DeepSeek, Moonshot AI, and MiniMax — had conducted industrial-scale knowledge distillation campaigns against Claude, generating over 16 million exchanges through approximately 24,000 fraudulent accounts (Anthropic, 2026). The campaigns targeted frontier capabilities including agentic reasoning, tool use, and coding, with the explicit goal of training competing models on the stolen outputs. MiniMax's campaign was detected while still active, providing unprecedented visibility into the lifecycle of a distillation attack. This disclosure made a question operationally urgent that had previously been theoretical: can the forensic trace of distillation survive adversarial countermeasures? An adversary who has already acquired a teacher's capabilities through distillation has strong incentive to erase the evidence — and as a white-box operator of the student model, has full gradient access to attempt it.

1.2 Three Papers, One Framework

This work is the third in a series establishing geometric foundations for neural network identity verification. In [Coslett, 2026a], we identified the \(\delta\)-gene — the third pre-softmax logit gap — as a temperature-invariant, architecture-agnostic behavioral fingerprint, proved its universality from extreme value theory (\(\delta_\text{norm} \approx 0.318\) across the Gumbel class), and constructed an Inference-Time Physical Unclonable Function (IT-PUF) achieving zero false acceptances across 1,012 comparisons spanning 23 models and 16 vendor families. The mathematical foundation was formally verified in Coq: 311 theorems across 16 files, with zero uses of Admitted. An interval-splitting impossibility theorem proved that no adversarial KL budget can simultaneously close the fingerprint gap and avoid detection via accumulated noise. In [Coslett, 2026b], we extended this framework to the API regime, demonstrating that Gumbel universality persists through commercial API logprob interfaces. A PPP-residualization transform revealed model-specific geometric fingerprints invisible in raw logprob data, enabling cross-session endpoint verification with 119/120 correct identifications across three temporal sessions. A conditional impossibility theorem for API-regime spoofing was machine-checked in Coq (41 theorems, 0 Admitted). The present work stress-tests this framework under adversarial conditions. We ask three questions: 1. Does knowledge distillation transfer the forensic fingerprint from teacher to student? 2. Can an adversary with full white-box access erase or redirect the transferred trace? 3. What geometric structure governs the success or failure of spoofing attempts?

1.3 Summary of Results

Across 72 experimental checkpoints (18 from the distillation experiment, 54 from the adversarial stress test), we establish six findings: 1. Two-Layer Identity. A model's structural fingerprint (weight-geometry observable) is invariant to distillation at acceptance threshold \(\varepsilon\), while its functional fingerprint (PPP-residual template) transfers predictably to the student. 2. KD Dominance. The knowledge distillation loss dominates all adversarial correction terms. White-box erasure achieves only transient suppression at epoch 2 before systematic rebound at epoch 3. 3. Passive Erasure Superiority. Passive fine-tuning on fresh data erases the functional trace more effectively than any adversarial method — but at measurable cost to general capability. 4. The Pareto Frontier. No tested configuration achieves both trace erasure and capability preservation. The adversary's informed effort is wasted. 5. The Illusion of Spoofing. Cross-family spoofing success (\(\text{Conv}_X = 0.694\)) is a geographic coincidence: pure knowledge distillation predicts \(R^2 = 0.995\) of the observed spoofing convergence. Same-family spoofing is anti-aligned and fails categorically. 6. \(\delta_\text{norm}\) Invariance. The Gumbel universality constant survives all 54 adversarial checkpoints (CV = 1.9%, mean = 0.3126), confirming that the thermodynamic layer of model identity is not addressable by training-time interventions.

2. Background and Notation

2.1 The \(\delta\)-Gene Framework

At each token position, a language model produces a vocabulary-sized logit vector \(\mathbf{z} \in \mathbb{R}^V\). The gaps between consecutive order statistics, \(G_k = z_{(k)} - z_{(k+1)}\), follow a Poisson Point Process (PPP) with exponential spacings for large vocabularies. The normalized third gap \(\delta_\text{norm} = G_3 / (G_2 + G_3 + G_4) \approx 0.318\) is a distribution-free constant of the Gumbel class, validated across six distinct neural architectures and 23 models [Coslett, 2026a]. The IT-PUF protocol operates in two regimes. In the weights regime, a geometric observable \(g_\text{norm}\) is measured from the model's internal activations under controlled challenge inputs, producing a high-dimensional response vector (\(\tau\)) that serves as the model's structural fingerprint. In the API regime, PPP-residualized gap templates — computed by subtracting the dominant tail scale factor \(\hat{\beta}/k\) from the observed logprob gaps — reveal model-specific geometric structure invisible in raw logprob data [Coslett, 2026b].

2.2 Threat Model

We consider an adversary who has performed knowledge distillation from a teacher model \(A\) into a student model \(B\). The adversary has: - Full white-box access to the student's weights and gradients - A frozen copy of the teacher's output logits (harvested during distillation) - Knowledge that forensic detection methods exist (Kerckhoffs's principle) - The ability to continue training the student with arbitrary loss functions The adversary's goals are: - Erasure: Remove the forensic trace of distillation from the student's API-observable fingerprint, so that the student's PPP-residual template no longer resembles the teacher's. - Spoofing: Redirect the student's fingerprint toward a third-party decoy model, framing the distillation as having originated from a different teacher. - Capability preservation: Maintain the student's generation quality (measured by perplexity on a held-out corpus) throughout the adversarial process.

2.3 Metrics

We define template convergence toward the teacher as the fraction of the baseline-to-teacher distance closed in PPP-residual template space:

\[\text{Conv}_T := 1 - \frac{d(T_S,\, T_A)}{d(T_{S,0},\, T_A)}\]

where \(T_S\) is the student's template after training, \(T_{S,0}\) is the undistilled baseline template, \(T_A\) is the teacher's template, and \(d(\cdot, \cdot)\) is \(L^2\) distance in the 7-dimensional residual space. \(\text{Conv}_T = 0\) indicates no movement toward the teacher; \(\text{Conv}_T = 1\) would indicate perfect template matching. Analogously, spoofing convergence toward a decoy model \(X\) is:

\[\text{Conv}_X := 1 - \frac{d(T_S,\, T_X)}{d(T_{S,0},\, T_X)}\]

General capability is measured by perplexity on a held-out C4 corpus (500 samples, max length 512 tokens) that no experimental variant trained on. We report \(\Delta\text{PPL} = \text{PPL}_\text{checkpoint} - \text{PPL}_\text{baseline}\).

3. Experiment I: Distillation Forensics

3.1 Experimental Design

We distill six variants from a single teacher (Qwen2.5-7B-Instruct, 7.6B parameters) into two student architectures (Qwen2.5-0.5B-Instruct, 494M; Llama-3.2-1B-Instruct, 1.24B), training for three epochs each and measuring every checkpoint in both the weights and API regimes. The six variants span the distillation design space:

Variant	Protocol	Top-\(K\)	Purpose
A1	High-bandwidth logit KD	200	Maximum information transfer
B1	Top-\(K\) masked KD	20	API-realistic bottleneck
B2	Top-\(K\) masked KD	7	Extreme information constraint
C1	Cross-tokenizer SFT	N/A	Text-only transfer, different architecture
D1	Self-distillation	200	Control: isolates "more training" effect
E1	Shuffled-logits KD	200	Control: tests distribution shape vs. ordering

Training uses bfloat16 precision, KL divergence temperature \(T = 2.0\), and AdamW optimizer (\(\text{lr} = 2 \times 10^{-5}\)). PPP-residualized templates are computed following [Coslett, 2026a; 2026b]. All 18 checkpoints (6 variants \(\times\) 3 epochs) are measured in both regimes.

3.2 Result: Structural Identity Is Invariant

All 18 distilled checkpoints remain within \(1\)–\(5 \times \varepsilon\) of their own undistilled baselines in \(g_\text{norm}\) distance, where \(\varepsilon\) is the IT-PUF acceptance threshold defined in [Coslett, 2026a]. The teacher remains \(726\)–\(1{,}212 \times \varepsilon\) away from every student checkpoint. From the weights perspective, all five Qwen-0.5B students are the same model regardless of distillation protocol. This result is consistent with the formal impossibility theorem of [Coslett, 2026a, §6.3]: the \(g_\text{norm}\) observable measures weight geometry, and knowledge distillation changes what a model outputs without altering the structural observable. The separation between any student and the teacher (\(\geq 726 \times \varepsilon\)) exceeds the separation between the most similar distinct models in the 23-model zoo (\(485 \times \varepsilon\), cross-family; \(1{,}186 \times \varepsilon\), same-family). A distilled student is more easily distinguished from its teacher than two unrelated models are from each other.

3.3 Result: Functional Identity Transfers

The PPP-residualized templates tell a different story. Teacher-distilled models converge toward the teacher's functional fingerprint, with the degree of transfer monotonic in the information bandwidth available during distillation:

Variant	Top-\(K\)	Best \(\text{Conv}_T\)	Epoch
A1 (high-bandwidth)	200	0.52	2
B1 (API-grade)	20	0.36	2
B2 (minimal)	7	0.31	3

(These convergence fractions are computed relative to each variant's own pre-distillation baseline distance to the teacher, following [Coslett, 2026b, §9.1]. The adversarial experiments of §4–5 use a single frozen baseline reference frame for cross-variant comparability.) The attenuation from \(K = 200\) to \(K = 20\) is modest (0.52 → 0.36). Even \(K = 7\) — extreme information deprivation — shows 31% convergence. The forensic channel survives aggressive bottlenecking. Both controls are null. D1 (self-distillation) diverges monotonically from the teacher (\(d = 2.25 \to 2.43 \to 2.48\)), eliminating the confound that additional training moves functional fingerprints toward arbitrary other models. E1 (shuffled-logits) preserves the teacher's marginal token distribution — the same tokens appear at the same frequencies — but destroys the structural ordering of which tokens compete at which ranks. The absence of convergence (\(d = 1.58 \to 1.28 \to 1.63\), non-monotonic) confirms that the PPP residual measures competitive dynamics between tokens, not aggregate distribution shape or stylistic similarity. The forensic channel is structural, not stylistic.

3.4 The C1 Confound

C1 (cross-tokenizer SFT from Qwen-7B into Llama-1B) achieves minimum PPP distance 0.266 from the teacher at epoch 2. However, the Llama-1B baseline is already 0.296 from the teacher — only 10% improvement over the untrained model. This coincidental baseline proximity undermines interpretability. The cleanest evidence comes from A1 and B1, which start far from the teacher and show sustained convergence.

3.5 The Two-Layer Identity Hypothesis

The results establish that neural network identity operates on two separable layers:

Layer	Observable	Transfer?	Forensic Question
Structural (weights)	\(g_\text{norm}\)	NOT transferable	What model is this?
Functional (API)	PPP residual	Partially transferable	Who taught this model?

The weights regime answers the first question with formal impossibility guarantees [Coslett, 2026a, NoSpoofing.v]. The API regime answers the second — the teacher's fingerprint transfers through distillation, even through a top-20 logprob bottleneck representative of real API interfaces. A natural question arises: can the functional trace be adversarially erased?

Figure 2 · The Two-Layer Identity Under Distillation

A · Structural (g_norm)

1–5×ε

from baseline (all 18 ckpts)

teacher: 726–1,212×ε away

INVARIANT

B · Functional (PPP)

31–52%

convergence toward teacher

controls: 0% (D1, E1)

TRANSFERS

Structural identity is immovable. Functional identity transfers through distillation.

4. Experiment II: Adversarial Erasure

4.1 Experimental Design

We test whether a white-box adversary can erase the functional provenance trace while maintaining generation quality. The experimental matrix comprises 18 variants across control, adversarial, and passive categories, yielding 54 total checkpoints (including extension epochs for F1). Teacher: Qwen2.5-7B-Instruct; student: Qwen2.5-0.5B-Instruct. The baseline-to-teacher distance is \(d = 1.546\).

Category	Variants	Description
Control	A1	Standard KD (replicates §3)
Erasure	F1–F5	White-box erasure, \(\lambda \in \{0.1, 0.3, 1.0, 3.0, 10.0\}\)
Passive	H1	SFT on Alpaca (52K instruction-following)
Passive	H2	SFT on wikitext (domain text)

The erasure variants (F1–F5) include a spoofing experiment reported separately in §5 (G1–G10).

4.2 The Adversarial Loss

The adversarial training objective combines knowledge distillation with an erasure penalty:

\[\mathcal{L} = \mathcal{L}_\text{KD} + \lambda \cdot \mathcal{L}_\text{erase}\]

where \(\mathcal{L}_\text{KD}\) is the standard KL divergence between student and teacher output distributions at temperature \(T = 2.0\) (Hinton et al., 2015), and the erasure loss pushes the student's PPP-residual template toward its undistilled baseline:

\[\mathcal{L}_\text{erase} = \max\!\big(0,\; m - \|T_S^{(\text{online})} - T_{S,0}\|\big)\]

The hinge formulation saturates once the template reaches margin \(m\) from the baseline, preventing the erasure gradient from pushing the model beyond a natural operating point. The online template estimate is tracked as a running statistic over the training trajectory. The hyperparameter \(\lambda\) controls the relative weight of erasure versus knowledge acquisition. The passive variants (H1, H2) use standard supervised fine-tuning loss with no adversarial component and no knowledge of the forensic observable. They represent the "uninformed adversary" — an operator who simply continues training on fresh data without any forensic intent.

4.3 Result: KD Dominance and the Sawtooth

The central finding is that the knowledge distillation gradient dominates the erasure gradient at every tested \(\lambda\) value, producing a characteristic "sawtooth" pattern: adversarial erasure achieves maximum effect at epoch 2, then systematically reverses at epoch 3.

Variant	\(\lambda\)	\(\text{Conv}_T\) (epoch 1)	\(\text{Conv}_T\) (epoch 2)	\(\text{Conv}_T\) (epoch 3)
A1 (control)	0	0.350	0.422	0.448
F1	0.1	0.232	0.260	0.185
F2	0.3	0.357	0.285	0.321
F3	1.0	0.279	0.113	0.415
F4	3.0	0.213	0.140	0.302
F5	10.0	0.238	0.242	0.178

The best adversarial result — F3 (\(\lambda = 1.0\)) at epoch 2 with \(\text{Conv}_T = 0.113\) — is a transient minimum. By epoch 3, the same variant rebounds to \(\text{Conv}_T = 0.415\), nearly matching the unmodified KD control (\(\text{Conv}_T = 0.448\)). The pattern repeats across F3 (\(0.279 \to 0.113 \to 0.415\)), F4 (\(0.213 \to 0.140 \to 0.302\)), and multiple spoofing variants (§5). The mechanism is straightforward: KD pulls the student toward the teacher (increasing \(\text{Conv}_T\)); erasure pushes away (decreasing \(\text{Conv}_T\)). KD wins because it operates on the full output distribution (vocabulary-sized logit vector), shaping the loss landscape across all token positions, while the erasure penalty operates on a low-dimensional projection (7D PPP-residual template). The erasure gradient is a correction term on a dominant gradient. With continued training, KD recaptures whatever ground the erasure penalty temporarily gained. This dimensional asymmetry suggests the result is not architecture-specific: any distillation objective that optimizes across the full vocabulary will dominate a penalty defined on a low-rank template projection.

This interpretation is confirmed by the F1 extension experiment: F1 (\(\lambda = 0.1\)) was extended to 5 epochs. \(\text{Conv}_T\) progressed as \(0.232 \to 0.260 \to 0.185 \to 0.363 \to 0.488\). By epoch 5, the gentle erasure penalty was completely overwhelmed — the model converged further toward the teacher than the standard KD control at epoch 3. The mechanism is an instance of adversarial regularization: by gently perturbing the template geometry, the erasure penalty smoothed the loss landscape and dropped the student into a steeper KD basin. The adversary's attempt to hide the trace made the theft more efficient. A further confirmation: \(\lambda\) has no effect on capability. Using wikitext perplexity as the training-time monitor, the collapse ceiling was set at 50% above the student baseline (PPL\(_\text{wiki}\) = 21.80, ceiling = 32.70). No variant at any \(\lambda\) approached it: F5 (\(\lambda = 10.0\)) at epoch 3 reports PPL\(_\text{wiki}\) = 22.98, functionally indistinguishable from the control. The erasure penalty is inert — KD dominates at all tested penalty strengths.

4.4 Result: Passive SFT Beats Active Erasure

The passive variants, trained with zero knowledge of the forensic observable, outperform all adversarial variants at erasure:

Variant	Knowledge	\(\text{Conv}_T\) (epoch 2)	C4 \(\Delta\)PPL
F3 (\(\lambda = 1.0\))	Full white-box	0.113	+2.1
F4 (\(\lambda = 3.0\))	Full white-box	0.140	+2.2
H1 (Alpaca SFT)	None	0.062	+3.1
H2 (wikitext SFT)	None	\(-\)0.113	+7.6

H2 (wikitext, epoch 2) achieves \(\text{Conv}_T = -0.113\) — the model has moved past its undistilled baseline, further from the teacher than it was before distillation. The provenance trace is not merely erased but overwritten. H1 (Alpaca) achieves \(\text{Conv}_T = 0.062\) — nearly erased — with no forensic knowledge whatsoever. Critically, Alpaca is instruction-tuning data — exactly the kind of continued training an adversary would perform to improve a distilled model for deployment — yet even this routine post-processing degrades general capability by +3.1 C4 PPL. The adversary's full white-box access — the \(100\times\) \(\lambda\) sweep, the gradient-informed erasure penalty, the knowledge of exactly which observable to target — bought approximately one PPL point of capability preservation while retaining twice as much forensic trace as the uninformed Alpaca baseline.

4.5 The Pareto Frontier

The held-out C4 benchmark reveals an inescapable tradeoff. Because training-time wikitext PPL conflates general capability with domain specialization (H2's PPL\(_\text{wiki} \approx 15.9\) reflects overfitting, not improvement), we evaluate all capability comparisons on 500 held-out C4 samples — a neutral dataset that no variant trained on:

Checkpoint	\(\text{Conv}_T\)	C4 PPL	\(\Delta\)PPL
Baseline (undistilled)	—	23.16	—
A1 epoch 2 (KD control)	+0.422	25.26	+2.1
F3 epoch 2 (best erasure)	+0.113	25.29	+2.1
H1 epoch 2 (passive Alpaca)	+0.062	26.24	+3.1
H2 epoch 2 (passive wikitext)	\(-\)0.113	30.74	+7.6

No configuration achieves both trace erasure (\(\text{Conv}_T \leq 0\)) and capability preservation (\(\Delta\text{PPL} < 3\)). The adversary faces a Pareto frontier with no favorable region: - White-box erasure (F3): preserves capability (\(\Delta\text{PPL} = +2.1\)) but retains the trace (\(\text{Conv}_T = +0.113\)). The adversary spent their gradient budget achieving functionally nothing. - Passive instruction SFT (H1): nearly erases the trace (\(\text{Conv}_T = +0.062\)) at modest capability cost (\(\Delta\text{PPL} = +3.1\)). Zero forensic knowledge required. - Passive domain SFT (H2): fully erases the trace (\(\text{Conv}_T = -0.113\)) but severely degrades general capability (\(\Delta\text{PPL} = +7.6\)). H2's wikitext PPL appears favorable (\(\approx 15.9\)), but this reflects domain specialization — the model traded general capability for wikitext fluency. The detection window for functional provenance is therefore determined by the rate of passive fine-tuning, not by adversarial sophistication. Because routine continued training (instruction-following, domain adaptation) passively erases the trace within 1–2 epochs, provenance detection is inherently time-sensitive. API-based functional fingerprinting must be deployed as a continuous, real-time detective control — not a retrospective audit.

Figure 4 · The Pareto Frontier: No Favorable Region

← trace retainedtrace erased →

F3

A1

H1

H2

no config here

↑ capability preservedcapability degraded ↓

Adversarial erasure (F3) retains trace. Passive SFT (H1/H2) erases trace but costs capability. No tested configuration achieves both.

5. Experiment III: The Geometry of Spoofing

5.1 Experimental Design

Beyond erasure, an adversary might attempt to redirect the student's fingerprint toward a third-party decoy, framing the distillation as having originated from a different teacher. We test spoofing against two decoys:

Category	Variants	Decoy	\(\lambda\) range
Cross-family spoofing	G1–G5	Llama-3.2-1B-Instruct	0.1–10.0
Same-family spoofing	G6–G10	Qwen2.5-1.5B-Instruct	0.1–10.0

The spoofing loss replaces the erasure loss from §4.2 with a term that minimizes distance to the decoy's template:

\[\mathcal{L}_\text{spoof} = \max\!\big(0,\; m' - \|T_S^{(\text{online})} - T_X\|\big)\]

where \(T_X\) is the frozen template of the target decoy. The full training objective is \(\mathcal{L} = \mathcal{L}_\text{KD} + \lambda \cdot \mathcal{L}_\text{spoof}\). All 10 spoofing variants (G1–G10, 5 \(\lambda\) values \(\times\) 2 decoys) are trained for 3 epochs, yielding 30 primary checkpoints.

5.2 The Apparent Vulnerability

Cross-family spoofing (G1–G5, targeting Llama-1B) produces ostensibly alarming results:

Variant	\(\lambda\)	Best \(\text{Conv}_{X,\text{Llama}}\)	Epoch
G1	0.1	0.624	3
G2	0.3	0.694	3
G3	1.0	0.351	2
G4	3.0	0.539	3
G5	10.0	0.383	3

G2 (\(\lambda = 0.3\)) at epoch 3 achieves \(\text{Conv}_{X,\text{Llama}} = 0.694\) — the student has closed 69.4% of the distance to the Llama-1B decoy's fingerprint. Taken at face value, this would represent a serious vulnerability: an adversary could distill from one model while appearing to have distilled from another. Same-family spoofing (G6–G10, targeting Qwen-1.5B) tells the opposite story: all 15 checkpoints have negative \(\text{Conv}_{X,\text{Qwen1.5}}\), meaning the student moved further from the decoy despite the explicit spoofing gradient. The best same-family result is \(\text{Conv}_{X,\text{Qwen1.5}} = -0.202\) (G10, \(\lambda = 10.0\), epoch 1) — failure even at maximal penalty strength. Why does cross-family spoofing appear to succeed while same-family spoofing fails categorically? The answer lies in the geometry.

5.3 The Alignment Diagnostic

We define the KD direction and decoy direction in PPP-residual template space:

\[\mathbf{v}_T = T_A - T_{S,0} \qquad \text{(teacher direction from baseline)}\]

\[\mathbf{v}_X = T_X - T_{S,0} \qquad \text{(decoy direction from baseline)}\]

The cosine alignment between these vectors determines whether KD-induced movement will collaterally approach or recede from the decoy:

\[\cos\theta_X = \frac{\mathbf{v}_T \cdot \mathbf{v}_X}{\|\mathbf{v}_T\| \cdot \|\mathbf{v}_X\|}\]

Decoy	\(\cos\theta\)	Angle	Predicted \(\text{Conv}_X\) sign
Llama-1B (cross-family)	\(+0.991\)	7.8°	Positive (appears to succeed)
Qwen-1.5B (same-family)	\(-0.747\)	138.3°	Negative (fails)

The KD direction and the Llama direction are nearly identical — 7.8° apart. Any training step that moves the student toward its teacher simultaneously moves it toward Llama-1B as a geometric side effect. The spoofing gradient is redundant; the KD gradient is already doing the work. To quantify the spoofing gradient's independent contribution, we regress \(\text{Conv}_{X,\text{Llama}}\) against \(\text{Conv}_T\) across all 54 checkpoints. Under the geometric model, if movement is purely KD-induced, then \(\text{Conv}_X\) is the scalar projection of the KD displacement onto the decoy direction, scaled by baseline distance — i.e., \(\text{Conv}_X\) should be a linear function of \(\text{Conv}_T\) with slope determined by \(\cos\theta_X\) and the distance ratio \(\|\mathbf{v}_T\| / \|\mathbf{v}_X\|\). Any independent spoofing contribution would appear as systematic positive residuals in the G-series variants. The pure-KD geometric decomposition yields \(R^2 = 0.995\): knowledge distillation movement alone, with zero spoofing gradient, predicts 99.5% of the variance in cross-family spoofing convergence. (An OLS regression of \(\text{Conv}_{X,\text{Llama}}\) on \(\text{Conv}_T\) yields \(R^2 = 0.998\) and \(r = +0.999\), confirming the relationship is effectively deterministic.) For the headline result (G2 epoch 3, \(\text{Conv}_{X,\text{Llama}} = 0.694\)), the pure-KD decomposition predicts 0.662. The spoofing gradient contributed 0.033 — approximately 4.8% of the total. The same decomposition for same-family spoofing yields \(r(\text{Conv}_T, \text{Conv}_{X,\text{Qwen1.5}}) = -0.993\) and \(R^2 = 0.986\). The anti-correlation is nearly perfect: every step toward the teacher is a step away from the same-family decoy.

5.4 Capability Topology vs. Corporate Lineage

The geometric explanation requires examining the reference distances in PPP-residual space:

Pair	\(d\)	Relationship
Teacher (Qwen-7B) \(\leftrightarrow\) Llama-1B	0.296	Neighbors
Baseline (Qwen-0.5B) \(\leftrightarrow\) Qwen-1.5B	0.238	Siblings
Baseline (Qwen-0.5B) \(\leftrightarrow\) Teacher (Qwen-7B)	1.546	Distant
Teacher (Qwen-7B) \(\leftrightarrow\) Qwen-1.5B	1.731	Very distant
Llama-1B \(\leftrightarrow\) Qwen-1.5B	1.526	Very distant

The geometric summary for each decoy:

Decoy	\(d(B, X)\)	\(d(T, X)\)	\(\cos\theta_X\)	Spoofing outcome
Llama-1B (cross-family)	1.323	0.296	\(+0.991\)	Illusory success
Qwen-1.5B (same-family)	0.238	1.731	\(-0.747\)	Categorical failure

Spoofing outcome is fully determined by \(\cos\theta_X\) and the decoy's proximity to the teacher; it does not require an explicit spoofing loss. Provider labels do not induce clean clustering in this space. The teacher (Qwen-7B) sits closer to Llama-1B (\(d = 0.296\)) than to its own sibling Qwen-1.5B (\(d = 1.731\)) — a \(5.8\times\) ratio. Models cluster by capability topology: the 7B-class models (Qwen-7B and Llama-1B) form one cluster regardless of vendor, while the sub-2B models (Qwen-0.5B and Qwen-1.5B) form another. The spoofing results follow deterministically from this geometry. Knowledge distillation moves the student from the small-model cluster toward the large-model cluster. Llama-1B is already in the large-model cluster — the student approaches it as a geographic side effect. Qwen-1.5B is in the small-model cluster — every step toward the teacher is a step away from it.

5.5 Spoofing Feasibility as a Geometric Prediction

The alignment diagnostic provides a predictive criterion for spoofing feasibility, computable from reference templates before any adversarial training is performed: - If \(\cos\theta_X > 0\): the decoy lies in the same half-space as the teacher relative to the baseline. Spoofing will appear to succeed but is a geographic coincidence — pure KD does the work. - If \(\cos\theta_X < 0\): the decoy lies in the opposite half-space. KD and spoofing gradients are anti-aligned. Spoofing will fail regardless of \(\lambda\). The sign prediction is correct for 54/54 Llama checkpoints and 54/54 Qwen-1.5B checkpoints (100% accuracy). The continuous prediction (geometric pure-KD decomposition) achieves \(R^2 = 0.995\) (Llama) and \(R^2 = 0.986\) (Qwen-1.5B). This diagnostic has implications beyond the specific models tested here. For any teacher-student-decoy triple, the feasibility of spoofing is determined by the angular relationship between the KD trajectory and the decoy direction. An adversary's chance of successful spoofing depends not on their gradient access but on the pre-existing geometric configuration of the models in PPP-residual space — a configuration they cannot control.

6. The Invariant Layer

Across all 54 adversarial checkpoints — spanning erasure (\(\lambda\) from 0.1 to 10.0), cross-family spoofing, same-family spoofing, and passive fine-tuning — the normalized third logit gap remains invariant:

\[\delta_\text{norm}: \quad \text{mean} = 0.3126, \quad \text{CV} = 1.9\%, \quad \text{range} = [0.302, 0.329]\]

This is consistent with the Gumbel-class prediction of 0.318 [Coslett, 2026a] and represents the deepest confirmation of EVT universality to date: \(\delta_\text{norm} \approx 0.318\) survives adversarial attack on the functional layer, combined with knowledge distillation, at penalty strengths spanning two orders of magnitude. The thermodynamic structure of the output layer is not learned — it is emergent from the interaction between vocabulary geometry and the cross-entropy training objective. No training-time intervention tested here was able to perturb it.

The Three-Layer Security Hierarchy

Layer 1 · g_norm (Weights)

1–5×ε · invariant to all tested training · THE VAULT

Layer 2 · PPP Residual (API)

erased in 1–2 SFT epochs · THE TRIPWIRE

Layer 3 · δ_norm (Universal)

CV 1.9% across 54 adversarial ckpts · THE SPECIES MARKER

7. Discussion

7.1 The Three-Layer Security Hierarchy

The combined results of Experiments I–III establish a three-layer hierarchy of neural network identity:

Layer	Observable	Resilience	Forensic Role
1. Structural (\(g_\text{norm}\))	Weight geometry	Invariant to all tested training	The Vault
2. Functional (PPP residual)	API logprob geometry	Transfers via KD; erased by passive SFT	The Tripwire
3. Thermodynamic (\(\delta_\text{norm}\))	Output-layer universality	Indestructible across all interventions	The Species Marker

Layer 1 answers "what model is this?" — with formal impossibility guarantees [Coslett, 2026a, NoSpoofing.v] and empirical confirmation across 18 distillation checkpoints (\(\geq 726 \times \varepsilon\) from teacher in all cases). This layer requires access to model weights. Layer 2 answers "who taught this model?" — through the teacher's functional fingerprint, which transfers 31–52% during distillation even through aggressive information bottlenecks. This layer is accessible through API logprob interfaces, but is transient: passive fine-tuning erases it within 1–2 epochs. Its forensic value is therefore time-sensitive. A conditional impossibility theorem [Coslett, 2026b, APINoSpoofing.v; 41 theorems, 0 Admitted] proves that standalone spoofing of this layer is KL-budget-exhausting under explicit trust assumptions. Layer 3 confirms "is this a neural language model?" — the \(\delta_\text{norm} \approx 0.318\) universality constant identifies any model as belonging to the Gumbel class, but does not distinguish between models. It serves as a species marker: a sanity check that the system under observation is a language model rather than an impersonator, lookup table, or corrupted deployment.

7.2 Operational Implications

The Pareto frontier of §4.5 translates directly into deployment guidance. Because routine continued training passively erases the functional trace within 1–2 epochs, provenance detection must be deployed as a continuous monitoring system. The detection window is measured in training epochs (a function of dataset size and learning rate), not in calendar time. An enterprise seeking to detect unauthorized distillation of its models must scan suspected adversary APIs immediately upon model release, while the trace is fresh. The three-layer hierarchy suggests a defense-in-depth architecture. When weights are accessible (insider threat, compliance audit, regulatory inspection), Layer 1 provides immutable ground truth. When only API access is available (supply-chain monitoring, competitive intelligence, marketplace compliance), Layer 2 provides time-sensitive evidence. Layer 3 provides a universal sanity check applicable in both regimes.

7.3 Limitations

Scale. All experiments use models between 0.5B and 7.6B parameters. Whether the Two-Layer Identity Hypothesis holds at frontier scale (100B+) remains an open question. The theoretical framework makes no scale-dependent assumptions — the Gumbel PPP structure depends on effective competitor count (a function of vocabulary size), not parameter count — but empirical confirmation at frontier scale has not been performed. Families tested. The spoofing geometry (§5) was tested with three models from two families (Qwen, Llama). The alignment diagnostic's predictive power should be validated across additional families, particularly between models of similar capability from different vendors. The claim that "corporate lineage is functionally incoherent" currently rests on a single geometry in a low-dimensional space. Erasure attack surface. We test L2-based erasure against the PPP-residual template. A learned discriminator (GAN-style adversarial training) might find more efficient erasure paths than the hinge-based penalty function used here. Post-hoc activation steering — modifying inference-time representations without retraining — remains untested. Single teacher-student pair. The adversarial stress test (§4–5) uses a single teacher-student pair (Qwen-7B → Qwen-0.5B). Generalization across teacher-student size ratios, architecture mismatches, and multi-teacher distillation scenarios is needed. Temporal resolution. Provenance erasure is quantified in epochs. The operational metric — erasure in tokens, gradient steps, or wall-clock time — requires calibration across training configurations.

7.4 Broader Implications

The results of this paper establish a new class of security guarantee: geometric constraints on adversarial behavior. The adversary's failure is not due to insufficient compute, clever defenses, or information asymmetry. It is due to the mathematical structure of the optimization landscape. The knowledge distillation gradient and the erasure gradient point in competing directions, and the former dominates because it is the primary training signal. This is a structural property of the learning problem, not an artifact of our experimental configuration. The capability topology observation (§5.4) — that models cluster by capability rather than corporate lineage in PPP-residual space — may have implications beyond forensics. It suggests that the "competitive landscape" of neural network outputs has a geometric structure that reflects functional similarity rather than provenance. Understanding this structure could inform model evaluation, benchmark design, and the study of emergent capabilities across model families.

8. Conclusion

Knowledge distillation creates a two-layer identity: the structural fingerprint is immovable (\(\geq 726 \times \varepsilon\) from teacher across all 18 protocols), while the functional fingerprint predictably transfers (up to 52% convergence). An adversary with full white-box access cannot erase the functional trace faster than uninformed passive training, and faces an inescapable Pareto frontier between capability preservation and trace removal. The adversary's \(100\times\) \(\lambda\) sweep bought one PPL point of capability advantage while retaining twice the forensic trace of the cheapest passive alternative. Apparent spoofing success is a spatial illusion. Models cluster by capability, not corporate lineage, and cross-family "spoofing" follows deterministically from the 7.8° alignment between the KD trajectory and the decoy direction (\(R^2 = 0.995\)). The adversary's spoofing gradient contributed 4.8% of the headline result. Same-family spoofing is anti-aligned (\(\cos\theta = -0.747\)) and fails categorically. Across all 54 adversarial checkpoints, the Gumbel universality constant (\(\delta_\text{norm}\)) remains invariant at CV = 1.9%, confirming that the thermodynamic layer of model identity is not addressable by training-time interventions. Under gradient-based logit matching with L2 erasure objectives across two orders of magnitude of penalty strength, the geometry forbids it.

References

View 1 references ↓

Anthropic. (2026). Detecting and preventing distillation attacks. Anthropic Blog, February 24, 2026. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks Coslett, A. R. (2026a). The \(\delta\)-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry. Zenodo. DOI: 10.5281/zenodo.18704275. Coslett, A. R. (2026b). Template-Based Endpoint Verification via Logprob Order-Statistic Geometry. Zenodo. DOI: 10.5281/zenodo.18776711. de Haan, L. and Ferreira, A. (2006). Extreme Value Theory: An Introduction. Springer. Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531. Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., and Goldstein, T. (2023). A watermark for large language models. ICML 2023. Leadbetter, M. R., Lindgren, G., and Rootzén, H. (1983). Extremes and Related Properties of Random Sequences and Processes. Springer. Pappu, R., Recht, B., Taylor, J., and Gershenfeld, N. (2002). Physical one-way functions. Science, 297(5589):2026–2030. Resnick, S. I. (1987). Extreme Values, Regular Variation, and Point Processes. Springer. Shao, S., Li, Y., He, Y., Yao, H., Yang, W., Tao, D., and Qin, Z. (2025). SoK: Large language model copyright auditing via fingerprinting. arXiv:2508.19843. Shao, S., Li, Y., Yao, H., Chen, Y., Yang, Y., and Qin, Z. (2026). Reading between the lines: Towards reliable black-box LLM fingerprinting via zeroth-order gradient estimation. The ACM Web Conference (WWW 2026). Suh, G. E. and Devadas, S. (2007). Physical unclonable functions for device authentication and secret key generation. DAC 2007. Yoon, D., Chun, M., Allen, T., Müller, H., Wang, M., and Sharma, R. (2025). Intrinsic fingerprint of LLMs: Continue training is NOT all you need to steal a model! arXiv:2507.03014. Zhang, J., Liu, D., Qian, C., Zhang, L., Liu, Y., Qiao, Y., and Shao, J. (2025). REEF: Representation encoding fingerprints for large language models. ICLR 2025.

Acknowledgments

Portions of this research were developed in collaboration with AI systems that served as co-architects for experimental design, adversarial review, and manuscript preparation. All scientific claims, experimental designs, measurements, and editorial decisions remain the sole responsibility of the author.

Patent Disclosure

The structural measurement protocol and adversarial erasure methodology described in this work operate within the scope of U.S. Provisional Patent Applications 63/982,893 and 63/990,487. Both provisional patents are assigned to Fall Risk AI, LLC.

Appendix A. Notation

Symbol	Definition	Introduced
\(G_k = z_{(k)} - z_{(k+1)}\)	\(k\)-th logit gap (order statistics, descending)	§2.1
\(\delta_\text{norm} = G_3 / (G_2 + G_3 + G_4)\)	Normalized third gap (Gumbel constant \(\approx 0.318\))	§2.1
\(g_\text{norm}\)	Structural identity observable (weights regime)	§2.1
\(T_m\)	PPP-residualized template for model \(m\) (API regime)	§2.1
\(\hat{\beta}\)	Robust tail scale estimator	§2.1
\(r_k = G_k - \hat{\beta}/k\)	PPP residual at rank \(k\)	§2.1
\(\varepsilon\)	IT-PUF acceptance threshold [Coslett, 2026a]	§3.2
\(\text{Conv}_T\)	Template convergence toward teacher	§2.3
\(\text{Conv}_X\)	Template convergence toward decoy \(X\)	§2.3
\(\lambda\)	Adversarial penalty weight	§4.2
\(\mathcal{L}_\text{KD}\)	Knowledge distillation loss	§4.2
\(\mathcal{L}_\text{erase}\)	Erasure loss (hinge, toward baseline)	§4.2
\(\mathcal{L}_\text{spoof}\)	Spoofing loss (hinge, toward decoy)	§5.1
\(\cos\theta_X\)	Cosine alignment between KD direction and decoy direction	§5.3

Appendix B. Complete Experimental Matrix

B.1 Distillation Experiment (18 checkpoints)

Variant	Epochs	Teacher	Student	Notes
A1	3	Qwen-7B	Qwen-0.5B	High-bandwidth KD (\(K = 200\))
B1	3	Qwen-7B	Qwen-0.5B	Top-20 KD
B2	3	Qwen-7B	Qwen-0.5B	Top-7 KD
C1	3	Qwen-7B	Llama-1B	Cross-tokenizer SFT
D1	3	Qwen-0.5B	Qwen-0.5B	Self-distillation control
E1	3	Qwen-7B	Qwen-0.5B	Shuffled-logits control

B.2 Adversarial Stress Test (54 checkpoints)

Variant	Category	\(\lambda\)	Epochs	Target
A1	Control	0	3	—
F1	Erasure	0.1	3 + 2 ext.	Baseline
F2	Erasure	0.3	3	Baseline
F3	Erasure	1.0	3	Baseline
F4	Erasure	3.0	3	Baseline
F5	Erasure	10.0	3	Baseline
G1–G5	Spoof (cross)	0.1–10.0	3	Llama-1B
G6–G10	Spoof (same)	0.1–10.0	3	Qwen-1.5B
H1	Passive SFT	N/A	2	Alpaca
H2	Passive SFT	N/A	2	wikitext

B.3 Epistemological Classification

Claim	Status	Evidence
Structural identity invariance to distillation	VALIDATED	18 checkpoints, 2 regimes (§3)
Functional identity transfer	VALIDATED	18 checkpoints, monotonic gradient (§3)
KD dominance over adversarial erasure	VALIDATED	54 checkpoints, sawtooth pattern (§4)
Passive SFT beats adversarial erasure	VALIDATED	Held-out C4 benchmark, 10 checkpoints (§4)
Alignment diagnostic (\(\cos\theta\), \(R^2\))	VALIDATED	54-checkpoint regression (§5)
Capability topology \(\neq\) corporate lineage	VALIDATED	Reference distance matrix (§5)
\(\delta_\text{norm}\) adversarial invariance	VALIDATED	CV = 1.9%, 54 checkpoints (§6)
Weights-regime impossibility	PROVEN	NoSpoofing.v, 51 theorems, 0 Admitted [2026a]
API-regime conditional impossibility	PROVEN	APINoSpoofing.v, 41 theorems, 0 Admitted [2026b]
Gumbel universality (\(\delta_\text{norm} \approx 0.318\))	PROVEN + VALIDATED	DeltaGap.v + 9 models + 54 adversarial [2026a]

All claims classified following the epistemological framework established in [Coslett, 2026a, §8.3]: PROVEN (Coq-checked), CITED (published math), DERIVED (computed from proven/cited), VALIDATED (empirical with evidence). No VALIDATED claim is presented as PROVEN.

Cite this paper

A. R. Coslett, "The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing," Paper III, Fall Risk AI, LLC, March 2026. DOI: 10.5281/zenodo.18818608

Click to select · Copy to clipboard