Abstract
Prior work established that knowledge distillation transfers a detectable provenance trace from teacher to student models, and that API endpoint verification can identify models through logprob order-statistic geometry. Both results were demonstrated on single teacher–student pairs and a six-model API zoo, leaving open whether provenance detection generalizes across model families and whether API verification scales to production-density endpoint populations. We address both questions through a coordinated experimental program spanning four studies. In the first study, we train 24 distilled checkpoints across 7 experimental arms — 3 teacher families (Qwen, Mistral, Llama), 4 student architectures (Qwen-0.5B, Qwen-1.5B, Llama-1B, Gemma-2B), and 2 training protocols (logit-level knowledge distillation and cross-tokenizer supervised fine-tuning) — measuring provenance transfer in both the weight-geometry and API-logprob regimes. Provenance transfer generalizes across the tested matrix: all 14 mature-epoch checkpoints show directional coupling to the teacher (cosine alignment \(\cos\theta > 0.8\), with 13 of 14 exceeding 0.85). The strongest signal arises in a cross-family arm (Mistral-7B \(\to\) Llama-1B, scalar convergence 0.858) that is inconsistent with a purely family-restricted transfer hypothesis within the tested matrix. The normalized third logit gap \(\delta_{\mathrm{norm}}\) remains within 1.4% coefficient of variation across all 31 checkpoints and 4 student architectures — the tightest confirmation of Gumbel-class universality in this experimental program. An extension to mixture-of-experts architecture (Mixtral-8x7B, \(\delta_{\mathrm{norm}} = 0.309\)) confirms that the universal constant persists under sparse expert routing. In the second contribution, we identify a systematic failure mode of scalar provenance metrics and introduce the geometrically correct directional diagnostic for provenance detection in inner-product spaces. The standard scalar convergence metric \(\mathrm{Conv}_T\) conflates direction and magnitude into a single value, discarding the directional information that provenance detection requires. In two independent experiments, this produced misleading conclusions: a false spoofing signal (\(R^2 = 0.995\) of apparent cross-family convergence explained by pure knowledge distillation geometry, with the adversarial gradient contributing 4.8%) and a false failure signal (negative \(\mathrm{Conv}_T\) despite consistent directional coupling at \(\cos\theta = 0.91\)). The alignment diagnostic applies the law of cosines in PPP-residual template space (vectors in \(\mathbb{R}^K\) with Euclidean distance) to decompose student movement into direction and magnitude, preserving the provenance signal that scalar distance metrics destroy. We establish a measurability threshold: when the baseline-to-teacher distance \(d(B,T)\) falls below approximately 1.0, scalar \(\mathrm{Conv}_T\) becomes unreliable and the directional diagnostic becomes the primary metric. This diagnostic applies to any distillation forensics framework that measures convergence in an inner-product space. In the third contribution, we extend API endpoint verification from 6 models to 14 across 3 commercial providers (OpenAI, Google Vertex AI, xAI), observing zero breaches across 182 pairwise impostor comparisons under per-model adaptive thresholds and three independent enrollment sessions, with a centroid reference protocol (CRP) that replaces the centroid \(L^2\) metric, which produces false breaches at 14-model density.
We establish a minimum truncation floor: API endpoints exposing fewer than 7 logprob ranks cannot support reliable verification (signal collapses within one rank of this boundary). Speculative decoding — an increasingly common inference optimization — is shown to be transparent to the verification protocol, with the speculative-decoded fingerprint deviating from the verifier-only fingerprint by 10.6% of the inter-model distance. Finally, we formalize the Trust Paradox in model forensics — a victim cannot prove weight theft without disclosing weights, and a suspect cannot prove innocence without disclosing training data — and propose a three-tier zero-knowledge attestation architecture that addresses it. The first tier (committed distance proof) enables a model owner to prove fingerprint proximity to a public anchor without revealing the fingerprint vector, using standard cryptographic commitments with verifier-controlled thresholds. The second tier (hardware-attested measurement) removes the requirement that the prover be trusted to compute the fingerprint correctly, binding the measurement to a trusted execution environment attestation. The third tier (full zero-knowledge extraction) eliminates all trust assumptions beyond cryptographic soundness: the prover demonstrates, in zero knowledge, that the committed fingerprint was correctly extracted from committed weights using the specified measurement algorithm. The architecture defines eight properties that a meaningful zero-knowledge model identity proof must satisfy — extending the formal verification doctrine (311 + 41 = 352 theorems across 17 Coq proof files [1, 2], 0 Admitted) into the cryptographic regime — and six explicit trust assumptions under which the proof statements hold. All three tiers are validated: Tier 1 (committed distance proof) has been implemented and hardened; Tier 2 (hardware-attested measurement) has been validated on production confidential computing hardware (6 models, 1,536 measurements, 0 failures inside an H100 trusted execution environment, with both CPU and GPU attestation tokens bound to a common cryptographic root and structural fingerprints transparent to confidential computing mode); and Tier 3 (full zero-knowledge extraction) has been validated — a complete circuit has been compiled and audited, all four pre-registered falsification criteria have been met, and the proof system operates within practical proving-time and proof-size bounds. The breakthrough discoveries enabled by Tier 3 validation, including an identity-conditioned inference verification architecture, are reported in the companion paper [6]. The experimental results in this paper are grounded in the formal verification stack and measurement infrastructure described in the companion papers [1, 2, 3]. All provenance claims are classified as VALIDATED (empirical); Tier 1 (committed distance proof) has been implemented and hardened, and Tier 2 (hardware-attested measurement) has been validated on production confidential computing hardware — both are classified VALIDATED. Tier 3 (full zero-knowledge extraction) has been validated: a complete circuit was compiled and audited, all four pre-registered falsification criteria were met, and the architecture has been extended into identity-conditioned inference verification [6]. All three tiers are classified VALIDATED. The authentication protocol and measurement methodology are the subject of U.S. Provisional Patent Applications 63/982,893 and 63/990,487; the zero-knowledge attestation architecture is the subject of U.S. Provisional Patent Applications 63/996,680; 64/003,244.
1. Introduction
1.1 The State of Play
On February 24, 2026, Anthropic publicly disclosed that three AI laboratories had conducted industrial-scale knowledge distillation campaigns against Claude, generating over 16 million exchanges through approximately 24,000 fraudulent accounts [4]. The campaigns targeted frontier capabilities including agentic reasoning, tool use, and coding, with the explicit goal of training competing models on stolen outputs. One campaign was detected while still active, providing unprecedented visibility into the lifecycle of a distillation attack. This disclosure made three questions operationally urgent that had previously been theoretical: 1. Does provenance detection generalize beyond the single teacher–student pair tested in prior work, or is it specific to the particular model families involved? 2. Does API endpoint verification scale to the density of production endpoint populations, where dozens of models from multiple providers must be simultaneously distinguished? 3. Can forensic evidence be presented in legal proceedings without disclosing the very weights at issue — the Trust Paradox that currently prevents victims from seeking judicial remedies? Prior work in this series addressed the foundational science.
In [1], we identified the \(\delta\)-gene (the third pre-softmax logit gap) as a temperature-invariant architectural fingerprint, constructed an Inference-Time Physical Unclonable Function (IT-PUF) — extending the physical unclonable function concept [8] to neural network inference — achieving zero false acceptances across 1,012 comparisons spanning 23 models and 16 vendor families, and formally verified the impossibility of fingerprint spoofing in Coq (311 theorems, 0 Admitted). In [2], we extended this framework to the API regime, demonstrating that Gumbel-class universality persists through commercial API logprob interfaces and establishing cross-session endpoint verification with per-model adaptive thresholds. In [3], we stress-tested provenance detection against a white-box adversary, establishing the Two-Layer Identity Hypothesis (structural identity invariant to distillation, functional identity partially transferable) and proving that adversarial erasure is geometrically coupled to the knowledge transfer objective — the adversary cannot erase the trace without also degrading the capabilities that distillation was meant to acquire. The present work answers the three scaling questions — where "scaling" encompasses both the technical dimension (from a single teacher-student pair to a multi-architecture matrix and production-density API endpoints) and the institutional dimension (from cooperative measurement in a research lab to adversarial forensic proceedings where no party is trusted). It is the infrastructure paper: Papers 1–3 proved the physics; this paper proves the engineering scales and the evidence can be made admissible.
1.2 Contributions
This paper makes four contributions: Multi-architecture provenance detection (§2). We train 24 distilled checkpoints across 7 experimental arms spanning 3 teacher families, 4 student architectures, and 2 training protocols, demonstrating that provenance transfer generalizes across the tested matrix. The strongest signal arises in a cross-family arm with zero overlap with the original experimental configuration, ruling out a family-restricted transfer hypothesis within this design. The alignment diagnostic (§3). We identify a systematic failure mode of scalar convergence metrics and apply the geometrically correct directional diagnostic — the law of cosines in the embedding space — to distillation forensics. The mathematical construction is standard in inner-product spaces; the contribution is recognizing the failure modes, demonstrating them empirically in two independent experiments where the scalar metric gave the wrong verdict, and establishing a measurability threshold below which scalar metrics are unreliable. We recommend this diagnostic for any forensic framework measuring convergence in Euclidean embedding spaces. API verification at scale (§4). We extend the API zoo from 6 to 14 models across 3 providers, establishing a minimum truncation floor (\(K \geq 7\)), validating speculative decoding transparency, and demonstrating that centroid reference protocol (CRP) achieves zero breaches where centroid \(L^2\) fails at production density. Zero-knowledge model attestation (§5). We formalize the Trust Paradox and propose a three-tier zero-knowledge architecture with pre-registered falsification criteria. Tier 1 (committed distance proof) has been implemented and hardened, and is classified VALIDATED; Tiers 2 and 3 are classified PROPOSED.
1.3 Reader's Guide
This paper serves multiple audiences. Readers primarily interested in provenance detection should focus on §2–3, which present the multi-architecture generalization results and the alignment diagnostic methodology. Readers interested in API verification engineering should focus on §4, which addresses truncation sensitivity, inference optimization transparency, and zoo scaling. Readers interested in the trust and admissibility problem should focus on §5, which presents the zero-knowledge framework. §6 synthesizes the three toolsets and discusses limitations.
1.4 Notation and Conventions
We inherit notation from [1, Appendix A] with the following additions specific to this work. Provenance metrics. Given a student model \(S\) trained by distillation from teacher \(T\), with undistilled baseline \(B\), we define the scalar convergence metric:
where \(d(\cdot, \cdot)\) is the \(L^2\) distance in PPP-residual template space. PPP-residual templates are vectors in \(\mathbb{R}^K\) (where \(K\) is the number of challenge prompts), and all distances, angles, and the law of cosines used throughout this paper are computed in this Euclidean space. \(\mathrm{Conv}_T = 0\) indicates no movement toward the teacher; \(\mathrm{Conv}_T = 1\) indicates complete convergence. Alignment diagnostic. The directional diagnostic applies the law of cosines at the baseline vertex of the triangle \((B, S, T)\):
where \(\theta\) is the angle between the baseline-to-teacher direction and the baseline-to-student direction. \(\cos\theta > 0\) indicates the student moved toward the teacher; \(\cos\theta < 0\) indicates away from the teacher. Operational thresholds. We adopt the following classification conventions for \(\cos\theta\), chosen as operational conventions (not derived from geometry) and verified to be insensitive across a sensitivity range of \([0.7, 0.9]\) for the ALIGNED boundary:
| \(\cos\theta\) | Classification |
|---|---|
| \(> 0.8\) | ALIGNED — operationally classified as directionally coupled |
| \(0.3\) to \(0.8\) | PARTIALLY ALIGNED — check later epochs |
| \(-0.3\) to \(0.3\) | ORTHOGONAL — ambiguous |
| \(< -0.3\) | ANTI-ALIGNED — investigate tokenizer disruption |
Centroid reference protocol (CRP). For API verification with \(K\) challenge prompts, the CRP distance between two enrollments \(A\) and \(B\) is:
where \(t_k^A\) is the PPP-residual template for prompt \(k\) from enrollment \(A\). This per-prompt averaging preserves discriminative structure that centroid \(L^2\) destroys at higher zoo densities [2, §6]. Epistemological status. Every claim in this paper is classified according to the status categories defined in [1, §8.3]: PROVEN (Coq theorem), CITED (published mathematics), DERIVED (computed from PROVEN/CITED), VALIDATED (empirical with evidence), or PROPOSED (framework without experimental confirmation). We never present VALIDATED as PROVEN. [Raw Logits / top-K logprobs] --> [Rank & compute gaps G_1...G_K] --> [Estimate β̂, subtract PPP prediction: residual_k = G_k - β̂/k] --> [Per-prompt PPP-residual template t_k ∈ R^K] --> [Aggregate across prompts: τ vector (weights regime) or template set (API regime)] --> [Distance computation: L² for weights, CRP for API] --> [Decision: Conv_T (scalar) + cos θ (directional)] Two parallel tracks branching after "Raw Logits": UPPER TRACK (API regime): top-K logprobs → gap residuals → CRP distance → cos θ LOWER TRACK (Weights regime): internal activations → g_norm τ vector → L² distance → ε threshold Color-code: blue=API regime, green=weights regime. Label the "Two-Layer Identity" at the fork point. This figure replaces 3 paragraphs of verbal setup for a cold reader. Format: Clean horizontal flowchart, no 3D effects. -->
2. Multi-Architecture Provenance Detection
2.1 Experimental Design
Paper 3 [3] established provenance detection for a single configuration: Qwen-7B-Instruct as teacher, Qwen-0.5B-Instruct as student, logit-level knowledge distillation. The obvious next question is whether this result survives variation in the teacher, the student, and the training protocol. We designed a 7-arm experiment crossing three teacher families (Qwen-7B-Instruct, Mistral-7B-Instruct, Llama-3.1-8B-Instruct), four student architectures (Qwen-0.5B, Qwen-1.5B, Llama-1B, Gemma-2B), and two training protocols (logit-level knowledge distillation and cross-tokenizer supervised fine-tuning on teacher-generated text). A self-distillation control (Llama-1B fine-tuned on its own outputs) provides a null baseline where no foreign provenance signal should appear. Table 1 shows the experimental matrix. Arms A1 and B1 use logit-level KD [5] (the teacher's output distribution is directly available to the student during training). All other arms use cross-tokenizer SFT: the teacher generates text, and the student is fine-tuned on that text as a standard supervised task — the training protocol most relevant to the industrial distillation campaigns disclosed in [4], where the attacker has API access but not weight access. Table 1. Experimental matrix. Each arm is trained for 3 epochs. Model names are abbreviated (Llama-8B = Llama-3.1-8B-Instruct, etc.). The \(d(B,T)\) column gives the PPP-residual distance between the student's undistilled baseline and the teacher, which governs the reliability of scalar \(\mathrm{Conv}_T\) (see §3).
| Arm | Teacher | Student | Protocol | \(d(B,T)\) |
|---|---|---|---|---|
| A1 | Qwen-7B | Qwen-0.5B | Logit KD | 1.581 |
| A2 | Mistral-7B | Qwen-0.5B | Cross-tok SFT | 3.304 |
| A3 | Llama-8B | Qwen-0.5B | Cross-tok SFT | 0.674 |
| B1 | Qwen-7B | Qwen-1.5B | Logit KD | 1.721 |
| B2 | Qwen-7B | Llama-1B | Cross-tok SFT | 0.331 |
| B3 | Qwen-7B | Gemma-2B | Cross-tok SFT | 2.207 |
| C1 | Mistral-7B | Llama-1B | Cross-tok SFT | 2.007 |
| D1 | Llama-1B | Llama-1B | Self-SFT (control) | \(\approx 0\) |
Training used standard hyperparameters within the budget class of 3 epochs and conventional KD temperature ranges. All 24 checkpoints (7 arms \(\times\) 3 epochs, plus D1 control \(\times\) 3 epochs) were measured in the API-logprob regime using the same curated challenge bank and measurement pipeline validated in [2, 3]. Cross-experiment measurement reproducibility — computed by re-measuring the baseline-teacher distance \(d(B,T)\) for the same frozen models across experiments — is 2.2%, confirming that the measurement engine is stable across sessions and experimental runs.
2.2 Provenance Transfer Results (API Regime)
Table 2 presents the core results. For each arm, we report the best \(\mathrm{Conv}_T\) across the 3 training epochs, the corresponding alignment diagnostic \(\cos\theta\), and the baseline-to-teacher separation \(d(B,T)\) that governs metric reliability. Table 2. Provenance transfer results. Best epoch shown for each arm (best \(\mathrm{Conv}_T\) for arms with positive convergence; best \(\cos\theta\) for \(\dagger\)-marked arms where scalar metrics are misleading — see §3). \(\mathrm{Conv}_T\) measures scalar convergence toward the teacher (1.0 = complete, 0.0 = none). \(\cos\theta\) measures directional alignment (1.0 = identical direction, 0.0 = orthogonal, \(-1.0\) = opposite).
| Arm | Teacher \(\to\) Student | \(\mathrm{Conv}_T\) | \(\cos\theta\) | \(d(B,T)\) | Epoch | Verdict |
|---|---|---|---|---|---|---|
| A1 | Qwen-7B \(\to\) Qwen-0.5B | +0.437 | 0.999 | 1.581 | 3 | ALIGNED |
| A2 | Mistral-7B \(\to\) Qwen-0.5B | +0.525 | 0.998 | 3.304 | 2 | ALIGNED |
| A3\(^\dagger\) | Llama-8B \(\to\) Qwen-0.5B | \(-0.907\) | 0.910 | 0.674 | 3 | ALIGNED |
| B1 | Qwen-7B \(\to\) Qwen-1.5B | +0.467 | 0.999 | 1.721 | 2 | ALIGNED |
| B2\(^\dagger\) | Qwen-7B \(\to\) Llama-1B | +0.505 | 0.873 | 0.331 | 2 | ALIGNED |
| B3 | Qwen-7B \(\to\) Gemma-2B | +0.727 | 1.000 | 2.207 | 2 | ALIGNED |
| C1 | Mistral-7B \(\to\) Llama-1B | +0.858 | 0.993 | 2.007 | 3 | ALIGNED |
| D1 | Llama-1B \(\to\) Llama-1B | — | — | \(\approx 0\) | — | NULL |
\(\dagger\) Arms where scalar \(\mathrm{Conv}_T\) is misleading and the alignment diagnostic provides the correct verdict (§3). For A3, \(\mathrm{Conv}_T\) is negative at all three epochs (\(-0.089\), \(-1.844\), \(-0.907\)) due to overshoot (§3.1, Case 2), while \(\cos\theta\) remains stably aligned (0.892, 0.868, 0.910). Table shows epoch 3 (best alignment). The central result: all 7 distillation arms show directional coupling to the teacher at their best mature epoch (\(\cos\theta > 0.85\) for all 7 arms). Across all 14 mature-epoch checkpoints (epochs 2 and 3 across all 7 arms), all are classified ALIGNED (\(\cos\theta > 0.8\)), with 13 of 14 exceeding 0.85 (§3.4). The self-distillation control D1 produces no convergence toward any external teacher, as expected. Arms share teachers and students across the matrix; generalization is demonstrated across the tested 7-arm design, not across independent arms. Finding 1: Cross-family provenance transfer. The strongest scalar convergence (\(\mathrm{Conv}_T = 0.858\)) arises in arm C1 (Mistral-7B \(\to\) Llama-1B), which involves zero Qwen components — a configuration sharing no model family overlap with the original Paper 3 experiment. This result is inconsistent with a purely family-restricted transfer hypothesis within the tested matrix and demonstrates that provenance transfer is a property of the distillation process, not of any particular model lineage. Finding 2: Protocol generality. Both logit-level KD (A1, B1) and cross-tokenizer SFT (A2, A3, B2, B3, C1) produce detectable provenance.
The cross-tokenizer SFT protocol is the more practically relevant: it mirrors the industrial attack vector disclosed in [4], where the attacker has only API access and generates text to use as training data. Table 2 presents only the best epoch per arm, concealing the trajectory shape that distinguishes same-tokenizer KD from cross-tokenizer SFT. Table 3 shows per-epoch trajectories for the five arms where the trajectory is informative: one same-tokenizer KD arm for baseline (A1), the two low-separation arms rescued by the alignment diagnostic (A3, B2), and the two high-separation cross-tokenizer arms showing the disruption-recovery pattern (B3, C1). Table 3. Per-epoch provenance trajectories. \(\mathrm{Conv}_T\) is the scalar convergence metric; \(\cos\theta\) is the alignment diagnostic. Same-tokenizer KD arm (A1) shown for contrast with cross-tokenizer SFT patterns.
| Arm | Epoch | \(\mathrm{Conv}_T\) | \(\cos\theta\) | Pattern |
|---|---|---|---|---|
| A1 | 1 | +0.215 | 0.952 | Same-tok: immediate alignment |
| A1 | 2 | +0.231 | 0.998 | Monotonic convergence |
| A1 | 3 | +0.437 | 0.999 | Continued convergence |
| A3\(^\dagger\) | 1 | \(-0.089\) | 0.892 | Aligned overshoot from epoch 1 |
| A3\(^\dagger\) | 2 | \(-1.844\) | 0.868 | Deep overshoot |
| A3\(^\dagger\) | 3 | \(-0.907\) | 0.910 | Partial recovery, still aligned |
| B2\(^\dagger\) | 1 | \(-1.358\) | \(-0.668\) | Cross-tok disruption |
| B2\(^\dagger\) | 2 | \(+0.505\) | 0.873 | Full recovery |
| B2\(^\dagger\) | 3 | \(-1.055\) | 0.850 | Overshoot (aligned) |
| B3 | 1 | \(+0.075\) | 0.780 | Weak initial alignment |
| B3 | 2 | \(+0.659\) | 1.000 | Strong convergence |
| B3 | 3 | \(+0.727\) | 0.997 | Continued convergence |
| C1 | 1 | \(-0.097\) | \(-0.810\) | Cross-tok disruption |
| C1 | 2 | \(+0.378\) | 0.961 | Full recovery |
| C1 | 3 | \(+0.858\) | 0.993 | Strongest convergence in matrix |
\(\dagger\) Low-separation arms (\(d(B,T) < 1.0\)) where \(\mathrm{Conv}_T\) is unreliable (§3). B1 (Qwen-7B \(\to\) Qwen-1.5B) follows the same monotonic pattern as A1, with \(\cos\theta > 0.92\) from epoch 1. Three trajectory patterns emerge. The same-tokenizer KD arms (A1, B1) show immediate directional coupling from epoch 1, with \(\cos\theta > 0.92\) throughout — the student locks onto the teacher direction before substantial convergence occurs. The cross-tokenizer SFT arms (B2, C1) show epoch-1 anti-alignment that fully recovers by epoch 2: tokenizer adaptation dominates the first epoch, temporarily pushing the student away from the teacher direction, but content-carrying provenance signal dominates from epoch 2 onward. The low-separation overshoot arms (A3, and B2 at epoch 3) show consistently positive \(\cos\theta\) masked by negative \(\mathrm{Conv}_T\) — the alignment diagnostic correctly diagnoses these as provenance transfer with overshoot rather than failure (§3). Finding 3: Cross-tokenizer epoch-1 disruption. Three cross-tokenizer SFT arms with non-native students (B2, B3, C1) show reduced or negative alignment at epoch 1, recovering to full alignment by epoch 2. The pattern is systematic: tokenizer adaptation dominates the first epoch of cross-tokenizer training, temporarily disrupting the provenance signal. Same-tokenizer KD arms (A1, B1) show immediate alignment from epoch 1 (\(\cos\theta > 0.92\)). This has a concrete operational implication: provenance scans against suspected distillation should examine at least two training checkpoints before concluding absence of transfer. The temporal window for detection — how quickly subsequent fine-tuning erases the provenance signal — is characterized at epoch granularity in [3] but not yet at the hourly or daily timescales needed for continuous monitoring (§6.2). Finding 4: Self-distillation null. D1 (Llama-1B trained on its own outputs) cannot compute a meaningful \(\mathrm{Conv}_T\) because the baseline-to-teacher distance is approximately zero — the model is its own teacher. This confirms that the provenance signal measured in other arms is attributable to foreign teacher information, not to the training process itself.
2.3 Structural Identity in the Weights Regime
Paper 1 [1] established that the weight-geometry observable \(g_\mathrm{norm}\) provides model identity verification with formal impossibility guarantees. Paper 3 [3] demonstrated that this structural layer is empirically invariant to distillation: all 18 tested checkpoints remained within a few multiples of the acceptance threshold \(\varepsilon\) from their undistilled baselines. We extend this validation to the multi-architecture matrix. Of the 8 arms (7 distillation + D1 control), 5 produce weight-geometry distances from baseline within the noise floor of the measurement — consistent with Paper 3's invariance finding. One arm (B1, Qwen-7B \(\to\) Qwen-1.5B) shows modest elevation above the noise floor but far below verification tolerance — a same-family KD arm where the larger student (1.5B vs. 0.5B) has more parameters available for gradient modification of cross-token statistics, yet the modification remains negligible relative to inter-model separation. The D1 self-distillation control falls within the clean range, confirming that the training process alone does not perturb structural identity. The exception is arm B3 (Qwen-7B \(\to\) Gemma-2B), which exhibits weight-geometry perturbation three orders of magnitude beyond any other architecture tested.
Critically, the movement is away from the teacher: distances to the teacher increase across epochs, while distances to the student's own baseline decrease, indicating the student is recovering toward its pre-training structural identity. The security claim — that distillation cannot transfer structural identity — holds for all arms, including B3. We hypothesize that B3's sensitivity is driven by Gemma-2's training-time logit soft-capping architecture (\(\mathrm{cap} \cdot \tanh(z/\mathrm{cap})\) with \(\mathrm{cap} = 30\)), which produces an effective weight-matrix rank of approximately 103 — substantially higher than uncapped architectures (e.g., Llama at \(\sim\)73). The higher-dimensional residual stream allows cross-tokenizer gradients to reshape cross-token statistics across a broader subspace, whereas low-rank architectures constrain gradient effects to a narrow manifold that the weight-geometry observable is insensitive to. This hypothesis is not validated in this paper; B3's anomaly is an architecture-specific structural sensitivity observation, not a protocol failure. A same-tokenizer teacher experiment on Gemma-2 would isolate the contribution of tokenizer mismatch from the effective-rank mechanism. The Two-Layer Identity Hypothesis [3] thus extends across the multi-architecture matrix: structural identity remains invariant (the Vault), functional identity partially transfers (the Tripwire). The separation is robust across all tested teacher-student-protocol combinations.
2.4 \(\delta_\mathrm{norm}\) Universality Across the Expanded Matrix
Across all 31 provenance checkpoints (24 distillation + 4 baselines + 3 teachers), \(\delta_\mathrm{norm}\) has a mean of 0.310 with a coefficient of variation of 1.4%, computed over the 31 checkpoint-level mean \(\delta_\mathrm{norm}\) values (each mean computed over the token population generated by the shared challenge bank and measurement pipeline). Table 4 shows a representative subset spanning all four student architectures, all three teachers, the baselines, the self-distillation control, and the MoE extension (§2.5); the full 31-checkpoint data is available in the supplementary materials. Table 4. Representative \(\delta_\mathrm{norm}\) values across the experimental matrix. Deviation from the EVT prediction of 0.318 is shown. The full population range is [0.301, 0.318] with CV = 1.4% over all 31 checkpoints.
| Checkpoint | Architecture | \(\delta_\mathrm{norm}\) | \(\Delta\) from 0.318 |
|---|---|---|---|
| baseline (Qwen-0.5B) | Qwen | 0.307 | \(-3.5\%\) |
| baseline (Llama-1B) | Llama | 0.316 | \(-0.6\%\) |
| baseline (Gemma-2B) | Gemma | 0.311 | \(-2.2\%\) |
| teacher (Qwen-7B) | Qwen | 0.313 | \(-1.6\%\) |
| teacher (Mistral-7B) | Mistral | 0.313 | \(-1.6\%\) |
| A3 epoch 2 (Llama-8B \(\to\) Qwen-0.5B) | Qwen | 0.318 | \(0.0\%\) |
| B2 epoch 3 (Qwen-7B \(\to\) Llama-1B) | Llama | 0.301 | \(-5.3\%\) |
| C1 epoch 3 (Mistral-7B \(\to\) Llama-1B) | Llama | 0.304 | \(-4.4\%\) |
| D1 epoch 1 (self-distill control) | Llama | 0.312 | \(-1.9\%\) |
| E1 Mixtral-8x7B (MoE, §2.5) | Mixtral | 0.309 | \(-2.8\%\) |
With that caveat, this is the tightest confirmation of Gumbel-class universality in this program, improving on the 1.9% CV reported in [3] (54 adversarial checkpoints) and the 3.0% CV reported in [3, §3] (18 distillation checkpoints). We report CV as a stability statistic — the realized dispersion of \(\delta_\mathrm{norm}\) across checkpoints — not as an estimator of population variance. The 31 checkpoints share a challenge bank and measurement pipeline, so they are not independent samples; a formal uncertainty quantification would require prompt-block bootstrap or similar resampling methods, which we defer. The progressive tightening — 3.0% \(\to\) 1.9% \(\to\) 1.4% as sample diversity has grown — is consistent with convergence toward a stable physical constant rather than an artifact of small sample sizes. The result spans four student architectures, three teacher families, and two training protocols — a substantially more diverse population than any prior measurement.
2.5 Mixture-of-Experts: Sparse Routing Below the Measurement Site
A natural concern is whether mixture-of-experts (MoE) architectures — which route tokens through different expert subnetworks — violate the universality prediction. If different experts produce systematically different output geometries, the aggregate \(\delta_\mathrm{norm}\) could deviate from the Gumbel constant. We measure Mixtral-8x7B-Instruct (46.7B total parameters, 8 experts per layer, GPTQ quantization) using the same curated challenge bank and measurement pipeline as all other arms. The result: \(\delta_\mathrm{norm} = 0.309\), within 2.8% of the EVT prediction of 0.318 and within the observed variability of dense transformer models measured in this work. Per-prompt \(\delta_\mathrm{norm}\) has a CV of 4.0% across the challenge bank, consistent with other models. This result is consistent with the EVT prediction's scope: \(\delta_\mathrm{norm}\) depends on the tail behavior of the hidden-state distribution reaching the unembedding matrix \(W_U\), not on the mechanism that produced it. MoE routing occurs deep inside the network, selecting which expert processes each token's hidden state; the measurement site is the output layer, after all expert outputs have been merged. In principle, routing decisions can alter the hidden-state distribution reaching \(W_U\), so transparency to MoE routing is an empirical finding in this model, not a theorem. However, for frontier models — many of which are widely suspected to use sparse routing — this empirical confirmation is operationally significant: it extends the architecture coverage of the universality claim to MoE models, joining the dense Transformer, parallel Transformer (phi-2), and Mamba SSM architectures validated in [1].
3. The Alignment Diagnostic
3.1 The Problem with Scalar Convergence
The scalar metric \(\mathrm{Conv}_T = 1 - d(S,T)/d(B,T)\) has a seductive simplicity: it reduces the provenance question to a single number between 0 and 1. But simplicity comes at a cost. Scalar distance conflates two geometrically independent quantities — direction and magnitude — into a single value that faithfully preserves neither. The alignment diagnostic \(\cos\theta\) separates them: it is the geometrically correct metric for the directional question that provenance detection actually asks ("did the student move toward the teacher?"), while \(\mathrm{Conv}_T\) answers a different, compound question ("did the student get closer to the teacher by the right amount?") that is sensitive to noise, overshoot, and geometric aliasing. In high-dimensional inner-product spaces, distance magnitudes are geometrically entangled with the baseline separation (the small-denominator trap). The law of cosines is not a heuristic correction; it is the strict mathematical uncoupling of trajectory direction from displacement magnitude — a property of the inner-product structure itself, applicable a priori to any Euclidean embedding space. We document two independent experiments in which the compound question and the directional question gave opposite answers. In both cases, the directional answer was subsequently confirmed by independent evidence.
This is not a post-hoc rescue — it is the distinction between a lossy scalar projection and the full geometric information it discards. (The diagnostic is undefined when \(d(B,S) = 0\) or \(d(B,T) = 0\); we exclude those degenerate cases, which correspond to a student that has not moved or a teacher indistinguishable from baseline.) Case 1: False spoofing (Paper 3). An adversarial spoofing experiment [3, §5] attempted to push a distilled student's fingerprint toward a Llama-1B decoy while simultaneously acquiring teacher capabilities. The scalar spoofing metric \(\mathrm{Conv}_X\) reached 0.694 — apparent success. But geometric analysis revealed that the teacher (Qwen-7B) and the decoy (Llama-1B) are neighbors in PPP-residual space, separated by only 0.296. Any movement toward the teacher automatically approaches the decoy as a side effect. The alignment diagnostic quantifies this: \(\cos\theta = +0.991\) between the KD trajectory and the decoy direction (an angle of 7.8°). A parameter-free decomposition — projecting the student displacement \((S - B)\) onto the baseline-teacher direction \(\hat{u}_{BT}\) and evaluating the residual — shows that pure KD movement (zero spoofing gradient) predicts \(R^2 = 0.995\) of the apparent spoofing convergence across all 54 checkpoints, with no fitted parameters. The adversarial spoofing gradient contributed 4.8% of the best result. The signal was geographic coincidence, not adversarial capability. Case 2: False failure (this work). In the multi-architecture experiment (§2), arm A3 (Llama-8B \(\to\) Qwen-0.5B) produced negative \(\mathrm{Conv}_T\) at all three epochs (\(-0.089\), \(-1.844\), \(-0.907\)), appearing to indicate no provenance transfer. But the alignment diagnostic tells a different story:
| Epoch | \(\mathrm{Conv}_T\) | \(\cos\theta\) | \(\theta\) (deg) | Interpretation |
|---|---|---|---|---|
| 1 | \(-0.089\) | 0.892 | 26.9 | Aligned + overshoot |
| 2 | \(-1.844\) | 0.868 | 29.8 | Aligned + overshoot |
| 3 | \(-0.907\) | 0.910 | 24.4 | Aligned + overshoot |
The student is moving in the teacher's direction at every epoch (\(\cos\theta > 0.86\)). The negative \(\mathrm{Conv}_T\) occurs because the student overshoots — it travels past the teacher in PPP-residual space, so \(d(S,T) > d(B,T)\) even though the movement direction is correct. At \(d(B,T) = 0.674\), the student at epoch 2 has traveled 2.471 from baseline (3.7\(\times\) the baseline-teacher gap) while maintaining 29.8° alignment with the teacher direction.
cos θ = 0.868
Student past teacher → negative scalar
but direction is CORRECT
R² = 0.995 (pure KD explains all)
Teacher near decoy → apparent spoofing
is geometric coincidence (4.8%)
3.2 Why Scalar Metrics Fail
The two failure modes share a geometric root cause. \(\mathrm{Conv}_T\) conflates two independent quantities — direction and magnitude — into a single scalar that preserves neither faithfully. The small-denominator trap. When \(d(B,T)\) is small, the denominator of \(\mathrm{Conv}_T\) amplifies noise.
Since \(\mathrm{Conv}_T = 1 - d(S,T)/d(B,T)\), a fluctuation \(\delta d\) in \(d(S,T)\) produces \(\delta\mathrm{Conv}_T \approx \delta d / d(B,T)\), so \(\mathrm{Var}(\mathrm{Conv}_T) \propto 1/d(B,T)^2\). Concretely: training noise in \(d(S,T)\) on the order of 0.1 produces \(\mathrm{Conv}_T\) swings of \(\pm 1.8\) when \(d(B,T) = 0.674\) (arm A3) but only \(\pm 0.06\) when \(d(B,T) = 3.304\) (arm A2). The overshoot blindness. A student that travels in the correct direction but beyond the teacher has \(d(S,T) > d(B,T)\), yielding negative \(\mathrm{Conv}_T\). The scalar metric cannot distinguish "overshot the target" from "moved in the wrong direction" — both produce the same sign. The geographic aliasing. When two reference points (teacher and decoy) are nearby in the metric space, movement toward one automatically registers as movement toward the other. \(\mathrm{Conv}_T\) and \(\mathrm{Conv}_X\) are computed independently and cannot detect this degeneracy. The alignment diagnostic resolves all three failure modes. Direction is independent of magnitude: a student at \(\cos\theta = 0.91\) is moving toward the teacher whether it has traveled 0.1 or 10.0 units from baseline. And direction can be compared across reference points: \(\cos\theta = 0.991\) toward teacher and \(\cos\theta = 0.991\) toward decoy reveals near-collinearity, not independent convergence.
3.3 The Measurability Threshold
The alignment diagnostic's advantage is most pronounced when \(d(B,T)\) is small. To quantify this, we partition the experimental arms by baseline-teacher separation: Low separation (\(d(B,T) < 1.0\)). Arms A3 (\(d = 0.674\)) and B2 (\(d = 0.331\)): mean \(\cos\theta = 0.621\) across 6 checkpoints. \(\mathrm{Conv}_T\) ranges from \(-1.844\) to \(+0.505\) — a span of 2.3 with sign changes. The alignment diagnostic ranges from \(-0.668\) to \(0.910\) — a large span, but the sign correctly diagnoses the epoch-1 cross-tokenizer disruption in B2 (§2.2, Finding 3) as transient anti-alignment rather than permanent failure. High separation (\(d(B,T) \geq 1.0\)). Arms A1, A2, B1, B3, C1: mean \(\cos\theta = 0.855\) across 15 checkpoints. \(\mathrm{Conv}_T\) and \(\cos\theta\) agree on verdict for all 15. This pattern defines an empirical measurability threshold, observed in the PPP-residual metric space used in this work:
| \(d(B,T)\) | \(\mathrm{Conv}_T\) Reliability | Recommended Metric |
|---|---|---|
| \(> 1.5\) | High (sign and magnitude agreement \(> 95\%\) with \(\cos\theta\)) | \(\mathrm{Conv}_T\) sufficient |
| \(1.0\)–\(1.5\) | Moderate (sign agreement \(> 90\%\); magnitude may diverge) | \(\mathrm{Conv}_T\) with alignment confirmation |
| \(< 1.0\) | Low (sign disagreement observed; magnitude unreliable) | \(\cos\theta\) primary, \(\mathrm{Conv}_T\) supplementary |
This threshold is a property of this metric space and this observable, not a universal law. The mechanism is arithmetic: \(\mathrm{Conv}_T = 1 - d(S,T)/d(B,T)\), so training noise \(\sigma\) in \(d(S,T)\) produces \(\mathrm{Conv}_T\) variance \(\propto \sigma^2 / d(B,T)^2\), assuming the training-induced fluctuation scale in \(d(S,T)\) is approximately constant across arms (i.e., independent of \(d(B,T)\)). For A2 (\(d(B,T) = 3.30\)), a noise magnitude of \(\sigma = 0.2\) produces \(\mathrm{Conv}_T\) variance of order 0.004; for A3 (\(d(B,T) = 0.67\)), the same noise produces variance of order 0.09 — a 24\(\times\) amplification that explains the wild oscillations observed in Table 3. Other forensic frameworks operating in different metric spaces would need to calibrate their own measurability boundaries.
3.4 Sensitivity Analysis of Classification Thresholds
The \(\cos\theta > 0.8\) boundary for ALIGNED classification was chosen as an operational convention prior to examining the mature-epoch aggregate results. We verify that conclusions are insensitive to this choice by testing three alternative boundaries: At \(\cos\theta > 0.7\): all 14 mature-epoch verdicts unchanged (14/14 ALIGNED). At \(\cos\theta > 0.85\): 13/14 ALIGNED. The single reclassification is B2 epoch 3 (\(\cos\theta = 0.850\)) — a low-separation arm (\(d(B,T) = 0.331\)) where the borderline value reflects the small-denominator regime documented in §3.3 rather than ambiguous provenance. At \(\cos\theta > 0.9\): 11/14 ALIGNED. Three checkpoints reclassify: A3 epoch 2 (\(\cos\theta = 0.868\)), B2 epoch 2 (\(\cos\theta = 0.874\)), and B2 epoch 3 (\(\cos\theta = 0.850\)). All three are low-separation arms (\(d(B,T) < 1.0\)), consistent with the measurability threshold (§3.3). Directional coupling remains positive and clear in all cases; the reclassifications reflect a tighter criterion applied to the regime where scalar noise is highest, not ambiguous provenance signal. The provenance generalization conclusion is robust across the tested sensitivity range \([0.7, 0.9]\). The pattern of reclassifications at stricter thresholds is itself informative: all occur in the low-separation regime, reinforcing the measurability threshold as the operationally relevant boundary.
3.5 Domain Generality
The alignment diagnostic is not specific to IT-PUF or to PPP-residual space. Directional decompositions based on the law of cosines are standard tools in trajectory analysis, embedding alignment, and representation learning — the mathematical construction itself is not novel. The contribution here is identifying the specific failure modes that arise when scalar convergence metrics are applied to distillation forensics, and demonstrating empirically that the directional diagnostic corrects these failures. The diagnostic applies to any distillation forensics framework that (a) embeds models into an inner-product space (e.g., Euclidean feature vectors) and measures distances with the induced \(L^2\) norm, (b) defines convergence as decreasing distance between student and teacher embeddings, and (c) must distinguish genuine convergence from geometric aliasing. The law of cosines is a property of the inner-product structure, not of the observable; it does not apply to arbitrary metric spaces without that structure. Any forensic framework measuring \(\mathrm{Conv}_T\) — whether based on logprob geometry, activation statistics, or output distribution matching — is susceptible to the same three failure modes documented above, provided the embedding space admits an inner product. We recommend evaluating directional diagnostics alongside scalar convergence metrics in any such framework, treating \(\mathrm{Conv}_T\) without \(\cos\theta\) as analogous to reporting a \(p\)-value without an effect size.
4. API Verification at Scale
Paper 2 [2] established API endpoint verification on a zoo of 6 models from 3 providers. Production deployment requires scaling to larger populations while maintaining verification margins. This section reports three studies that address the scaling constraints: truncation sensitivity (§4.1), inference optimization transparency (§4.2), and zoo expansion (§4.3).
4.1 Truncation Sensitivity: The \(K \geq 7\) Floor
Commercial API providers expose varying numbers of top-\(k\) logprobs. The verification protocol operates on the gap structure among these top-\(k\) entries. A natural question is: what is the minimum \(K\) that supports reliable verification? We reprocess the 6-model, 3-session dataset from [2] at each truncation level from \(K = 2\) to \(K = 19\), recomputing CRP distances and per-model thresholds at each level. The result is a sharp transition: At \(K = 7\): zero breaches across all model pairs using per-model adaptive thresholds. The minimum margin is positive. Below \(K = 7\): breaches appear immediately. The transition is sharp, not gradual: Table 5. Verification performance under per-model adaptive thresholds \(\tau_m\) as a function of truncation level \(K\), reprocessed from the 6-model, 3-session dataset of [2]. Models column indicates how many of the 6 models are available at each \(K\) (xAI endpoints cap at \(K = 8\)). Intermediate values omitted; full contour available on request.
| \(K\) | Models | Per-model breaches | Regime |
|---|---|---|---|
| \(\geq 10\) | 4/6 (no xAI) | 0 | Clean; confounded by xAI exclusion |
| 7 | 6/6 | 0/90 | Floor: all models, zero breaches |
| 6 | 6/6 | 1/90 | First breach |
| 5 | 6/6 | 3/90 | Rapid degradation |
| 2 | 6/6 | many | Signal destroyed |
The transition is sharp: zero per-model breaches at \(K = 7\), first breach at \(K = 6\), rapid degradation below. The apparent improvement at \(K > 7\) is confounded: higher \(K\) values exclude xAI endpoints (which cap at \(K = 8\)), removing the structurally weakest model from the comparison pool. The signal improvement above \(K = 7\) reflects the removal of a difficult model, not additional information from extra logprob ranks. The transition reflects the verification decision boundary crossing at \(K = 7\): the underlying signal likely degrades smoothly with decreasing \(K\), but the operating point crosses the zero-margin boundary between \(K = 7\) and \(K = 6\). Below this boundary, performance collapses rapidly. At \(K = 6\), the first breach appears; by \(K = 5\), the minimum margin has turned decisively negative. Two additional findings emerge from the truncation analysis: Per-model thresholds are mandatory. A single global threshold fails even at the optimal \(K\), because inter-model drift variance spans an order of magnitude. Per-model adaptive thresholds \(\tau_m\) eliminate all breaches at \(K = 7\) (and above) on this dataset.
Global thresholds are dead for production use; we do not report them without per-model thresholds alongside. Top-1 species marker. Even at \(K = 1\) (a single logprob per token — the minimal logprob information, still requiring that the endpoint expose at least the top-1 logprob), two summary statistics (mean and variance of the top-1 logprob) achieve 66.7% model identification accuracy across the 6-model pool. This is 4\(\times\) chance level and cannot support verification, but it functions as a coarse species marker — a rapid triage indicator deployable when the endpoint exposes minimal logprob information. Endpoints that expose no logprobs at all cannot support even this level of discrimination. Operational implication. The \(K \geq 7\) floor defines the minimum viable endpoint for API verification. At the time of measurement (March 2026), one major provider exposes \(K = 20\), and two others expose \(K = 8\) — placing them at the floor with a one-rank margin. A further reduction in \(K\)-cap by any provider operating at \(K = 8\) would drop their endpoints below the verification floor. This is a monitored business risk, not a protocol limitation. If commercial \(K\)-caps converge below this floor, third-party API verification via top-\(K\) logprob geometry becomes infeasible for this class of protocols — pushing the industry toward cooperative attestation mechanisms (cryptographic or otherwise) or alternative telemetry for non-cooperative model identity verification.
12
5
4
3
1
0
0
4.2 Speculative Decoding: Transparent to Verification
Speculative decoding [7] is an increasingly common inference optimization that uses a smaller "draft" model to propose token candidates, which the larger "verifier" model accepts or rejects. Because the final token selection depends on both models, there is a question of whether the fingerprint of the deployed endpoint reflects the verifier, the draft model, or some hybrid. We measure four configurations: verifier-only (V), draft-only (D), speculative decoding (SD), and speculative decoding with aggressive draft parameters (SD-low, \(N = 2\)). The distance matrix is: Table 6. Pairwise CRP distances between four inference configurations.
| V | D | SD | SD-low | |
|---|---|---|---|---|
| V | — | 1.244 | 0.131 | 0.215 |
| D | 1.244 | — | 1.201 | 1.092 |
| SD | 0.131 | 1.201 | — | 0.164 |
| SD-low | 0.215 | 1.092 | 0.164 | — |
The ratio \(d(\mathrm{SD}, V) / d(V, D) = 0.106\) — speculative decoding preserves 89.4% of the verifier's fingerprint distance from the draft model. The draft model is invisible through the speculative decoding mechanism. The \(\delta_\mathrm{norm}\) shift between V and SD is 0.004, within measurement noise. This result validates that the IT-PUF protocol correctly identifies the verifier model through speculative decoding, regardless of which draft model is used. Under more aggressive draft parameters (SD-low, \(N = 2\), producing more accept/reject transitions), the ratio increases to 17.25% — still well under any reasonable detection threshold. The scaling of residual deviation with accept/reject frequency suggests limited draft-model influence under tested conditions, but we do not attribute the residual to a specific mechanism; only the bound matters for verification.
4.3 Zoo Expansion: 14 Models, Zero Breaches
We expand the API verification zoo from 6 models [2] to 14 models across 3 commercial providers. Earlier work [2] reported 6 named endpoints; here we expand the evaluation set but redact endpoint identifiers to avoid publishing a difficulty map and vendor availability inventory. We report provider names, model counts per provider, and 1–2 named exemplars for credibility. Screening. 17 candidate endpoints were probed for logprob capability. 14 were promoted to full enrollment. Screening criteria: the endpoint must return top-\(k\) logprobs with \(K \geq 7\) (the floor established in §4.1), exhibit non-degenerate gap structure, and survive a sentinel filter for anomalous logprob values. Three candidates were excluded during screening. We do not report which endpoints failed or the specific failure modes, as the exclusion list constitutes competitive intelligence about the provider landscape. Enrollment. Each promoted endpoint was enrolled using three independent sessions with the same curated challenge bank used throughout this work. Per-model adaptive thresholds \(\tau_m\) were calibrated from the enrollment data. Provider coverage and per-model results: Table 7. 14-model API zoo: provider-level verification summary under CRP with per-model adaptive thresholds \(\tau_m\). Gap ratio is the ratio of minimum impostor CRP distance to maximum genuine CRP distance; values \(> 1.0\) indicate positive margin. Aggregate statistics reported per provider to characterize the verification landscape without disclosing per-endpoint calibration.
| Provider | Models | \(K\)-cap | Min Gap Ratio | Median Gap Ratio | Notes |
|---|---|---|---|---|---|
| OpenAI | 5 | 20 | \(> 2.0\) | \(> 3.5\) | All strong margin |
| Google Vertex | 5 | 8 | \(> 2.0\) | \(> 4.5\) | Includes 2 deterministic\(^\dagger\) |
| xAI | 4 | 8 | \(> 2.0\) | \(> 3.5\) | Includes 1 marginal (weakest in zoo) |
\(\dagger\) Two Google endpoints exhibit fully deterministic logprob outputs across sessions (genuine distance = 0, formally infinite gap ratio). This is operationally favorable but fragile — any model update introducing stochasticity requires re-enrollment. The weakest endpoint in the zoo operates at a gap ratio above 2.0\(\times\) but below 3.0\(\times\), representing the conservative lower bound on the protocol's verification margin at this zoo density. \(K\)-caps are empirically observed values (March 2026), not documented guarantees. These values have changed without notice in the past (one provider contracted from \(K = 20\) to \(K = 8\) between measurement periods) and may change again. Results: 0/14 breaches under CRP with per-model \(\tau_m\). A breach is defined as any model whose minimum impostor CRP distance (to its nearest non-self model) is less than its maximum genuine CRP distance (across enrollment sessions). Under this definition: 0 of 14 models breach their per-model thresholds. The 14 models generate \(14 \times 13 = 182\) pairwise impostor comparisons, all of which maintain positive margin. By the rule of three, the 95% confidence upper bound on the false acceptance rate is \(\leq 3/182 \approx 1.6\%\) at the pairwise level, or \(\leq 3/14 \approx 21\%\) at the per-model level. These bounds assume independent trials; the 182 comparisons share models and enrollment sessions, so the effective degrees of freedom are lower than 182 — the pairwise bound should be treated as approximate rather than exact. The per-model bound reflects the small number of enrolled endpoints, not the discriminative power of the protocol. The minimum per-model margin across all 14 models is positive (\(\Delta_\mathrm{min}(14) > 0\)). The margin decay curve — how the minimum margin evolves as models are added — stabilizes by \(N = 10\) and holds through \(N = 14\). The curve is flat, not cliff-shaped, suggesting room for further zoo expansion before critical density. CRP is mandatory at this density. When we compute the same analysis using centroid \(L^2\) distance (the simpler metric used in [2] for the 6-model zoo), 3 of 14 models breach their thresholds — a false alarm rate of 21%.
The mechanism has a clean mathematical explanation. Let \(\Delta_k = t_k^A - t_k^B\) be the per-prompt template difference. Centroid distance is \(\|\mathbb{E}_k[\Delta_k]\|\); CRP distance is \(\mathbb{E}_k[\|\Delta_k\|]\). By Jensen's inequality, \(\|\mathbb{E}_k[\Delta_k]\| \leq \mathbb{E}_k[\|\Delta_k\|]\) — centroid distance is always less than or equal to CRP distance. When per-prompt differences point in diverse directions (as they do for impostor pairs, because different prompts probe different geometric dimensions), the centroid collapses toward zero while CRP preserves the discriminative signal. The curated challenge bank probes the output geometry across an 8.4\(\times\) range in the first gap statistic; two models that are centroid-identical produce different per-prompt profiles because their internal geometry responds differently to diverse prompt types. CRP distance preserves this signal; centroid averaging destroys it. Three stacked mechanisms achieve the zero-breach result, each discovered in prior work and now validated at 14-model scale:
| Mechanism | Origin | What It Fixes |
|---|---|---|
| CRP (per-prompt distance averaging) | [2, §6] | False collisions from centroid collapse |
| Multi-session enrollment (3 sessions) | [2, §5], Patent 2 §3 | Session-specific noise in templates |
| Per-model adaptive thresholds \(\tau_m\) | [2, §6] | Cross-model threshold penalty |
Additional findings from the provider landscape. The logprob exposure landscape as of March 2026 divides endpoints into three categories, determining their eligibility for the verification protocol: Table 8. Logprob exposure landscape (March 2026). Endpoints below the \(K \geq 7\) floor (§4.1) cannot support the verification protocol described in this work.
| Category | \(K\)-cap | VERIFY eligible | Example endpoints |
|---|---|---|---|
| Full logprobs | 20 | Yes | OpenAI gpt-4.1 family, gpt-4o family |
| Reduced logprobs | 8 | Yes (at floor) | xAI grok family, Google Vertex gemini family |
| No logprob API | — | No | Multiple providers: reasoning-class models, permanently disabled access |
| Degenerate output | — | No | Sentinel saturation across all token positions |
The third and fourth categories are not protocol failures — they define the boundary of the verification regime. The boundary is set by provider policy decisions (whether to expose logprobs) rather than by any intrinsic limitation of the models themselves. The tools developed in §2–4 — the alignment diagnostic, the CRP metric, the \(K \geq 7\) floor — solve the technical scaling challenges of model forensics. We can now detect provenance transfer across architectures, and we can verify endpoint identity across production-density zoos. But detecting theft and proving theft in an adversarial legal proceeding are distinct problems. As these tools move from the laboratory to the courtroom, they encounter an obstacle that no geometric metric can resolve: the measurement itself requires access to the very assets under dispute.
5. Scaling Trust: Zero-Knowledge Attestation for Forensic Admissibility
All three tiers of this section are classified as VALIDATED (committed distance proof, hardware-attested measurement, and full zero-knowledge extraction, respectively). The experimental contributions in §§2–4 are independently VALIDATED. All falsification criteria are met and on record. The zero-knowledge attestation architecture described in this section is the subject of U.S. Provisional Patent Applications 63/996,680; 64/003,244. The presentation here is at the proof-statement level: we describe what each tier proves and under what assumptions, without disclosing implementation details of the cryptographic circuits.
5.1 The Trust Paradox
Returning to the Anthropic disclosure that opened this paper [4]: §2–3 demonstrated that provenance transfer from an industrial distillation campaign would be detectable across architectures. §4 demonstrated that this detection scales to production endpoint density. But the victim in that disclosure still cannot take the evidence to court. While the preceding sections demonstrate that neural network identity can be measured and provenance transfer detected, these measurements require access — to weights (structural identity) or to API logprobs (functional identity) — creating a paradox when forensic evidence must be presented in adversarial legal proceedings. A victim laboratory (the party whose model was distilled) cannot prove weight theft without disclosing its crown-jewel weights to the measurement engine. Disclosure of the weights to a third-party verifier defeats the purpose of the intellectual property protection that motivated the forensic investigation. Worse, the weights themselves are typically trade secret — disclosure may waive legal protections. A suspect laboratory (the party accused of distillation) cannot prove innocence without disclosing its training data and training procedures. If the suspect's model is independently developed (no distillation), the current framework cannot certify this without the suspect granting weight access for measurement — an invasive step that the suspect has every incentive to refuse.
Courts require non-repudiable evidence. The current IT-PUF protocol produces mathematically grounded forensic measurements, but the measurement process itself requires a trusted party with access to the sensitive inputs. In regulated arbitration (e.g., patent infringement, trade secret misappropriation), no such trusted party may exist. Concrete scenario. Consider the Anthropic disclosure [4]: three laboratories conducted industrial-scale distillation against Claude. Suppose the victim (Anthropic) wishes to demonstrate in court that a competitor's model carries Claude's provenance fingerprint. The IT-PUF measurement requires either (a) weight access to the suspect's model (the suspect will refuse), or (b) API logprob access to both models (available, but the measurement process involves running Anthropic's proprietary challenge bank through the suspect's API, and the resulting template vectors may leak information about Anthropic's measurement methodology). Even if the measurement is performed, the court must trust the entity that computed the distance — and both parties have adversarial incentives regarding the outcome. A zero-knowledge protocol would allow the victim to prove "the distance between our enrolled anchor and the suspect's measured fingerprint is below the acceptance threshold" without revealing the anchor, the fingerprint, or the challenge bank — converting the trust requirement from institutional to mathematical.
5.2 Three-Tier Architecture
We propose a three-tier architecture in which each tier independently removes one layer of trust from the verification process. Each tier is independently useful, independently patentable, and independently falsifiable. Tier 1: Committed Distance Proof. The model owner computes the fingerprint vector \(\tau\) from their own weights, commits to \(\tau\) using a cryptographic commitment scheme, and proves that the committed vector's distance from a public anchor satisfies a verifier-controlled threshold — without revealing \(\tau\) itself. The commitment scheme provides binding (the prover cannot change \(\tau\) after committing) and hiding (the verifier learns only that the distance satisfies or fails the threshold). This tier is useful when the prover is the victim: a laboratory can demonstrate that its enrolled model is close to (or far from) a reference model without disclosing its weights or fingerprint. The trust assumption is that the prover computed \(\tau\) correctly — an assumption appropriate when the prover is the aggrieved party seeking to establish its own model's identity. Implementation status: A Tier 1 committed distance proof has been implemented, hardened against multiple adversarial attack classes, and validated against the Eight Properties (§5.4) with one documented limitation on Property 7 (non-malleability), mitigated by nonce binding. Tier 1 is classified VALIDATED for the distance-proof tier. Tier 2 and Tier 3 implementation status follows below. The threshold \(\varepsilon^2\) (the squared acceptance distance) must be verifier-controlled, not prover-supplied. A prover-supplied threshold is vacuous — the prover could set \(\varepsilon^2\) large enough to encompass any distance. Sealed mode (where the verifier specifies \(\varepsilon^2\) and learns only a binary PASS/FAIL) is the default protocol; disclosure mode (where the prover reveals the actual distance) is available when the prover consents. Tier 2: Hardware-Attested Measurement. Tier 1 assumes the prover computed \(\tau\) honestly. Tier 2 removes this assumption by binding the computation to a trusted execution environment (TEE) attestation. The measurement code runs inside a hardware enclave, and the TEE produces an attestation report that chains to the Tier 1 commitment: the attestation certifies that the committed \(\tau\) was produced by the specific measurement code applied to the weights loaded into the enclave. This tier is useful for third-party auditing: a regulator or arbitrator can verify that the measurement was correctly computed without the model owner being able to substitute a different fingerprint. The trust shifts from "the prover is honest" to "the hardware is honest" — a standard assumption in confidential computing that is independent of the parties involved. Tier 2 attestations can chain to Tier 1 proofs: a hardware attestation certifies the commitment, and the commitment feeds into the distance proof. This chaining is a separate claim — it requires that the attestation format be compatible with the commitment scheme's binding property. Implementation status: A Tier 2 hardware-attested measurement has been validated on production confidential computing hardware (NVIDIA H100 with GPU confidential computing enabled, running inside an Intel TDX trusted execution environment on a major cloud provider).
The measurement engine produced valid structural fingerprints for 6 models spanning 4 architecture families (1,536 total measurements, 0 failures). Canonical weight hashing — operating on raw safetensors byte representations without dtype conversion — is deterministic, handles bfloat16 weights natively, and produces byte-identical results inside and outside the trusted execution environment. The structural fingerprint is transparent to confidential computing mode: measurements inside the enclave match measurements on standard hardware within the genuine-repeat noise bound, confirming that the encrypted memory path does not perturb the measurement. Both CPU-level (TDX) and GPU-level attestation tokens bind to a common cryptographic root derived from the Pedersen commitment coordinates, the weight hash, a verifier-supplied nonce, the prompt bank hash, and the protocol version, with both local and remote attestation verification confirmed. Weight substitution attacks are detected (different models produce different hashes), and replay attacks are rejected (fresh nonce produces fresh binding root). The full measurement-to-attestation pipeline completes in under 50 seconds for a 7B-parameter model. Tier 2 is classified VALIDATED. Tier 3: Full Zero-Knowledge Extraction. Tier 2 requires trust in the hardware manufacturer. Tier 3 would eliminate all trust assumptions beyond cryptographic soundness: the prover demonstrates, in zero knowledge, that the committed fingerprint was correctly extracted from committed weights using the specified measurement algorithm. No trusted hardware, no trusted prover, no trusted third party — only the mathematical guarantee that the proof could not have been constructed without knowledge of a valid witness. The primary feasibility gate for Tier 3 is fixed-point precision: the measurement algorithm involves floating-point arithmetic on high-dimensional weight matrices, and the cryptographic circuit must reproduce this computation in fixed-point arithmetic without introducing quantization errors that exceed the minimum pairwise separation in the zoo. The four pre-registered falsification criteria (F1–F4, §5.5) were the governing test. Implementation status: A complete Tier 3 circuit has been compiled, audited, and validated. All four pre-registered falsification criteria have been met: fixed-point quantization at 16-bit precision preserves the three-orders-of-magnitude safety margin established in Phase 0 (F1); circuit compilation produces constraint counts yielding practical proving times well within the pre-registered 1-hour bound on commodity hardware (F2); the witness structure preserves Property 6 (privacy) with no intermediate value exposure (F3); and every witness variable carries a constraint chain to a public input — zero unconstrained witnesses, the ZK analog of zero Admitted (F4). The proof system operates at approximately ~296K constraints with aggregate proof size under 100 KB and verification time under 1 second. A suite of 124 adversarial negative tests produced zero false acceptances. Tier 3 is classified VALIDATED. The identity-conditioned inference architecture enabled by this result — extending the proof system from fingerprint extraction to verified output binding — is reported in [6].
5.3 Threat Model
The zero-knowledge architecture operates under six explicit trust assumptions, numbered to match the filed patent specification: ZTA1 (Prover Possesses Actual Weights). The prover's private input is a fingerprint extracted from actual neural network weights, not a fabricated vector. At Tier 1, this is a trust assumption (the prover self-reports). At Tiers 2 and 3, this is enforceable: hardware attestation or circuit-level proof binds to committed weights on which actual tensor operations are performed. ZTA2 (Canonical Measurement). The measurement circuit or binary is the canonical protocol, verified by hash. This excludes modified measurement programs that compute something other than the structural fingerprint. Enforceable at all tiers: the circuit identity or binary hash is a public input bound into the proof or attestation. ZTA3 (Anchor Integrity). The public reference anchor was enrolled by a trusted party using the same measurement protocol. This is an operational assumption analogous to certificate authority trust in public key infrastructure. ZTA4 (Cryptographic Soundness). The zero-knowledge proof system is computationally sound under standard cryptographic assumptions — the same class of assumption underlying all deployed cryptographic systems. ZTA5 (Challenge Freshness). The prover cannot observe the verifier's challenge (nonce and threshold) before committing to weights.
Enforceable through protocol design: commitment before challenge reveal. ZTA6 (Public Parameter Integrity). Public parameters (structured reference string, commitment generators, curve parameters) are generated honestly, and composition of proofs across tiers preserves soundness. Enforceable via transparent setup ceremonies or verifiable multi-party computation. The forwarding attack (routing API queries to the target model in real time) and the float-spoofing attack (fabricating logprob values) are excluded in the structural regime by ZTA1 (the prover must possess actual weights, not fabricated vectors) and ZTA2 (the measurement must be the canonical protocol). In the API regime, these attacks are excluded by the IT-PUF threat model assumptions TA1–TA4 defined in [1, §6] and [2, §9.2]. These exclusions are explicit, not hidden. Regime interaction. The zero-knowledge architecture operates in the weights regime (\(g_\mathrm{norm}\) fingerprint vectors). A valid Tier 1 proof of weight distinctness does not disprove API-regime provenance detection (the Two-Layer Identity [3]): structural identity and functional identity are measured in different spaces with different observables. This boundary must be respected in any legal deployment — a ZK proof that two models have distinct weights does not preclude that one was functionally distilled from the other.
5.4 The Eight Properties
We define eight properties that a meaningful zero-knowledge model identity proof must satisfy, extending the formal verification doctrine (352 theorems across 17 Coq proof files [1, 2], 0 Admitted) into the cryptographic regime: 1. Completeness. An honest prover with a genuine enrollment can always produce a valid proof. 2. Soundness. No computationally bounded adversary can produce a valid proof for a false statement. 3. Zero-knowledge. The proof reveals nothing beyond the truth of the statement. 4. FAR preservation. The zero-knowledge verification has a false acceptance rate no worse than the underlying IT-PUF protocol. 5. Binding. The prover cannot change the committed fingerprint after the commitment is published. 6. Privacy. The fingerprint vector and model weights remain hidden from the verifier. 7. Non-malleability. Proofs cannot be transformed, replayed, or composed in unintended ways. 8. Precision. All numerical parameters (fixed-point bit-widths, quantization bounds) are derived from zoo statistics, not chosen by convention. Property 8 is the analog of the Hardening Doctrine's "no heuristics" requirement: a bit-width chosen because "16-bit is standard" without verifying that quantization error preserves the zero false acceptance rate is the ZK analog of Admitted in Coq — it creates a degree of freedom that undermines the proof's meaning.
5.5 Falsification Criteria
Tier 3 is falsified if any of the following are demonstrated: F1. Fixed-point quantization at any bit-width introduces errors exceeding half the minimum pairwise separation in the 23-model zoo, making the zero-knowledge verification weaker than the plaintext protocol. F2. Circuit compilation produces constraint counts that exceed practical proving time (defined as \(> 1\) hour on commodity hardware for a single proof). F3. The witness structure requires exposing intermediate values that violate Property 6 (privacy). F4. Any witness variable in the compiled circuit lacks a constraint chain to a public input (the zero-knowledge analog of Admitted). These criteria were pre-registered. All four criteria have been met. No criterion was triggered; we report this result without salvage or parameter tuning. Evaluation of F1. We have conducted a Phase 0 evaluation of the precision gate (F1) by quantizing the full 23-model weight-geometry zoo (1,012 pairwise comparisons, 64-dimensional \(\tau\) vectors) to fixed-point representations at bit-widths from 4 to 32. The pre-registered threshold is \(\frac{1}{2} \Delta_{\min}\), where \(\Delta_{\min}\) is the minimum pairwise separation in the zoo. The precision cliff falls in the sub-16-bit range: at 16-bit, the safety margin is three orders of magnitude; below 16-bit, margins narrow before becoming negative at lower bit-widths.
At 16-bit — the standard precision in zero-knowledge machine learning systems — the safety margin exceeds \(10^3 \times\) the pre-registered threshold. At 8-bit, the margin is positive but narrow. At 7-bit, the pre-registered threshold is breached (though no actual false acceptances occur). The first false acceptances appear at 4-bit, confirming that the pre-registered threshold triggers conservatively before security degrades. The recommended circuit precision is 16-bit, providing a safety margin of three orders of magnitude while operating at standard zkML bit-widths. Two observations inform the circuit design. First, the pre-registered threshold and FAR fail at different bit-widths, providing a multi-bit conservative buffer between threshold breach and actual security degradation. Second, the worst-case pairs shift across bit-widths: the pair with smallest separation in the full-precision zoo is not the quantization-critical pair at lower bit-widths. Certain \(\tau\) vectors occupy regions of the measurement space where quantization noise is maximized — a finding that motivates setting the global circuit precision conservatively above the highest-noise region of the embedding space, rather than optimizing for the average case. The Phase 0 evaluation addresses F1. The full circuit compilation, audited for F2–F4, is reported as part of the Tier 3 validation above and in detail in [6].
6. Discussion
6.1 Three Tools for Model Forensics
The four papers in this series have converged on three complementary forensic tools, each with a distinct operational role: The Vault (structural identity, weights regime). The weight-geometry observable \(g_\mathrm{norm}\) provides model identity verification with formal impossibility guarantees [1, NoSpoofing.v]. It is empirically invariant to distillation across all tested protocols [3, this work §2.3] — including cross-tokenizer SFT, adversarial erasure, and multi-architecture training. The vault answers: What model is this? Its limitations: requires weight access, and the zero-knowledge extension (§5) is now validated through all three tiers (Tier 1 committed distance proof, Tier 2 hardware-attested measurement, and Tier 3 full zero-knowledge extraction, all VALIDATED). The identity-conditioned inference architecture extending this full-stack validation is reported in [6]. The Scanner (functional identity, API regime). PPP-residual templates enable provenance detection through API logprobs — detecting the teacher's trace in a distilled student without access to either party's weights. The scanner answers: Who taught this model? Its limitations: the trace is transient (passive fine-tuning erases it in 1–2 training epochs [3]), the measurability threshold requires \(d(B,T) > 1.0\) for reliable scalar detection (§3.3), and the zero-breach guarantee at 14-model scale depends on CRP with per-model thresholds (§4.3). The Resilience Proof (adversarial robustness). Formal impossibility theorems [1, NoSpoofing.v; 2, APINoSpoofing.v] and empirical adversarial validation [3, this work §2] establish that the geometry forbids simultaneous capability acquisition and trace erasure during active distillation.
The resilience proof answers: Can the adversary escape detection? Its limitations: conditional on stated threat model assumptions (TA1–TA4 for API, ZTA1–6 for ZK), and the API-regime impossibility is conditional rather than unconditional. Deployment sequencing. The three tools compose in a natural temporal sequence. The Scanner operates first: it requires only API access and can be deployed continuously against suspected endpoints with no cooperation from the suspect. Because the provenance trace is transient (1–2 SFT epochs in tested configurations [3]), the Scanner must be deployed rapidly after the suspected distillation — the operational window is measured in training epochs, not calendar time, but since most deployers continue training beyond initial distillation, the window may be narrow. In practice, this means continuous high-frequency monitoring (e.g., daily API cross-enrollment against suspected endpoints) to capture the provenance signal before post-distillation fine-tuning erases it. If the Scanner detects a provenance signal, the Vault provides confirmation with formal impossibility guarantees — but requires weight access, which in practice means either judicial discovery (requiring a legal proceeding already in progress) or voluntary disclosure. The Resilience Proof serves the legal argument: it establishes that the detected signal could not have been planted or faked within the constraints of the threat model. The zero-knowledge framework (§5) addresses the gap between detection (Scanner, no weight access needed) and confirmation (Vault, weight access needed) by allowing confirmation without disclosure — converting the deployment sequence from Scanner → legal discovery → Vault to Scanner → ZK proof → legal proceeding.
6.2 What This Work Does Not Show
Frontier scale. All provenance experiments use models in the 0.5B–8B parameter range. Furthermore, while this work tests distinct pre-training lineages (Llama, Mistral, Qwen), all teachers occupy the 7B–8B parameter class of instruction-tuned dense transformers. Provenance transfer dynamics across radically asymmetric scales (e.g., 400B teacher to 1B student) remain untested. The security guarantees and provenance open questions occupy different epistemic categories. For the weight-geometry regime, ScalingLaws.v formally proves (0 Admitted, 2 empirical axioms grounded in 12-model measurement across a 147\(\times\) parameter range) that the spoofing cost \(K_\mathrm{min}\) remains strictly positive for any model with stiffness \(S \geq S_\mathrm{min} = 1.18\) — security provably does not degenerate at scale. The open question is whether functional identity transfer (the PPP-residual convergence measured in §2) survives at 100B+ parameters: whether a frontier student's representational capacity absorbs teacher information without producing detectable geometric displacement. At frontier scale, the student possesses a vastly larger null space, raising the possibility that capabilities could be routed around the specific geometric dimensions monitored by the measurement — potentially warping the Pareto frontier that Paper 3 established at 7B. The physics (\(\delta_\mathrm{norm}\) universality, Gumbel-class tail behavior) is scale-free in principle, and the stiffness scaling law (\(S \sim \mathrm{params}^{-0.08}\)) is bounded away from zero across the measured range, but the empirical provenance validation does not extend to frontier models. Multi-teacher mixtures. All arms involve a single teacher. Industrial distillation campaigns may blend outputs from multiple teachers. Whether the provenance signal decomposes linearly across teachers, or whether multi-teacher training creates qualitatively different transfer patterns, is untested. Control arm breadth. The self-distillation null (D1) was tested in one architecture (Llama-1B). Prior work provides cross-experiment confirmation — Paper 3 [3] tested D1 alongside a shuffled-logits control (E1), and the present work replicates D1 — but a broader null-distribution across architectures would be needed to generate the statistical power required for legal-grade false-positive bounds on provenance transfer. Long-term API drift. The API zoo expansion (§4.3) used short inter-session gaps. Production monitoring requires characterization of drift at hourly, daily, and weekly timescales to calibrate re-enrollment cadence. This characterization is needed before continuous monitoring can be deployed with confidence bounds. Training dynamics. All measurements in this program are post-training: they characterize models after training is complete. How the \(\delta\)-gene observable evolves during pre-training — whether it converges to the Gumbel constant gradually, undergoes phase transitions, or oscillates — is unknown.
Instrumenting training checkpoints at regular intervals would provide the first empirical picture of identity formation dynamics. Adversarial erasure breadth. While Paper 3 [3] established the geometric defeat of adversarial erasure, that stress-test was bounded to a single student architecture (Qwen-0.5B). This work expands provenance transfer testing to four architectures but does not repeat the adversarial erasure protocol across them. Validating the Resilience Proof on structurally diverse students — particularly soft-capped architectures like Gemma-2, which exhibit extreme weight-geometry sensitivity (§2.3) — remains future work. GAN-discriminator erasure. Paper 3 [3] tested erasure via \(L^2\) loss against frozen PPP-residual templates. A learned discriminator trained end-to-end against the provenance signal — analogous to a GAN critic — represents an untested but potentially more effective erasure strategy: it could discover and target provenance-carrying dimensions that the \(L^2\) objective distributes effort across uniformly. CRP scaling analysis. The CRP distance metric (§4.3) empirically outperforms centroid \(L^2\) at 14-model density, and the Jensen's inequality argument explains why CRP distances are weakly larger. An open question is whether the CRP advantage grows with prompt diversity and zoo density, or whether centroid \(L^2\) eventually catches up as the prompt bank covers more geometric dimensions. A rigorous analysis of CRP's scaling properties would inform bank size optimization for larger zoos. Zero-knowledge validation. The three-tier framework in §5 spans a range of validation states. Tier 1 (committed distance proof) has been implemented and hardened: the circuit has been tested against multiple adversarial attack classes and satisfies the Eight Properties with one documented limitation (Property 7, mitigated by nonce binding). Tier 1 is classified VALIDATED. Tier 2 (hardware-attested measurement) has been validated on production confidential computing hardware: 6 models measured inside an H100 trusted execution environment with 0 failures across 1,536 measurements, both attestation tokens bound to a common cryptographic root with local and remote verification confirmed, weight hashing byte-identical across CC and standard environments, and structural fingerprints transparent to confidential computing mode. Weight substitution and replay attacks are detected. Tier 2 is classified VALIDATED. Tier 3 (full zero-knowledge extraction) has been validated: a complete circuit was compiled and audited, all four pre-registered falsification criteria (F1–F4) were met, and 124 adversarial negative tests produced zero false acceptances. The proof system operates at approximately ~296K constraints. Tier 3 is classified VALIDATED. The architecture enabled by this full-stack validation — identity-conditioned inference, extending verified fingerprint extraction to verified output binding — is reported in [6]. As the enrollment zoo grows, the minimum pairwise separation Δ_min will decrease, eventually compressing the safety margin at any fixed bit-width; a migration path from 16-bit to higher precision must be part of the production protocol when zoo density dictates.
6.3 Implications for the Provider Landscape
The logprob exposure landscape as of March 2026 divides endpoints into three categories: those with full verification capability (\(K \geq 7\) logprobs, suitable for the protocol described in §4), those with partial information (top-1 logprobs only, suitable for coarse species marking), and those with no logprob access. The third category includes reasoning-class model families from one major provider, endpoints from a provider that has permanently disabled logprob access, and models exhibiting degenerate output statistics. The \(K\)-cap contraction observed in one provider (from \(K = 20\) to \(K = 8\) between measurement periods) is a monitored risk. At \(K = 8\), these endpoints operate at the verification floor with a single-rank margin. A further contraction to \(K = 7\) would place them exactly at the floor; \(K = 6\) would push them below it (§4.1). Two additional landscape features emerged from the zoo expansion. First, two Google endpoints (gemini-2.5-flash-lite, gemini-3.1-pro-preview) exhibit fully deterministic logprob outputs across sessions — zero inter-session drift. This is operationally favorable for verification (infinite gap ratio) but fragile: if a model update introduces stochasticity, these endpoints acquire nonzero genuine distances and must be re-enrolled. Second, the asymmetric logprob landscape creates a compliance signal for enterprises with contractual model requirements. An enterprise that has contracted exclusively with a provider whose API does not expose logprobs can treat any logprob response from its AI middleware as proof of unauthorized model substitution — a single probe returning valid logprobs constitutes a policy violation with zero false positives by construction. This test applies to the middleware gateway itself: if an enterprise mandates a zero-logprob provider policy, a logprob-returning probe to their own routing layer (e.g., LiteLLM, LangChain) constitutes evidence of unauthorized endpoint routing regardless of the abstraction layers involved. The authorized model is invisible to the probe; any visibility is an alarm. This negative-space signal does not require the full verification protocol and can be deployed as a lightweight compliance check within the INTAKE step.
6.4 Relation to Concurrent Work
Model fingerprinting and ownership verification are active research areas with several recent contributions that share aspects of the IT-PUF approach while differing in methodology and security guarantees. Watermarking and proactive methods. Watermarking approaches [6, 10] inject detectable signals during training or generation, enabling post-hoc ownership claims. These methods require operator cooperation (the model provider must embed the watermark) and are vulnerable to removal via fine-tuning or output post-processing. The IT-PUF approach requires no cooperation from the model provider — it extracts identity from intrinsic output geometry. Post-hoc fingerprinting. Shao et al. [9] provide a systematization of knowledge covering fingerprinting approaches for LLM copyright auditing. The majority of methods surveyed rely on scalar distance thresholds (\(L^2\), KL divergence, or cosine similarity) applied to model outputs or internal representations. The alignment diagnostic (§3) identifies a failure mode shared by all such scalar-threshold methods: geographic aliasing, in which movement toward a distillation teacher produces apparent convergence toward an unrelated model that happens to be a geometric neighbor. Any framework measuring scalar convergence in an embedding space is susceptible to this failure unless it incorporates a directional diagnostic. We are not aware of prior work that identifies or addresses this vulnerability. Zhang et al. [11] (REEF, ICLR 2025) address a complementary problem: fingerprinting models that have been fine-tuned after release to detect unauthorized derivative works. Their approach uses representation engineering to identify persistent features that survive fine-tuning. The IT-PUF framework addresses a different threat: distillation (knowledge transfer through the API bottleneck) rather than fine-tuning (direct weight modification). The two approaches are complementary — REEF addresses the derivative-work threat where the adversary has weights, while IT-PUF addresses the distillation-theft threat where the adversary has only API access. Yoon et al. [12] propose attention-pattern fingerprinting, using the structure of attention maps as a model identifier. This is architecturally specific to Transformer models with standard attention and does not extend to architectures without attention mechanisms (e.g., Mamba SSMs, which are included in the IT-PUF zoo [1]).
The \(\delta\)-gene observable is architecture-agnostic by construction — it measures output-layer geometry, not internal mechanisms. Shao et al. [13] (ZeroPrint, WWW 2026) estimate input-output Jacobians via semantic-preserving word substitutions and use Fisher information to argue that Jacobians encode more parameter identifiability than plain outputs (AUC \(\approx\) 0.72). The IT-PUF framework uses Fisher information in a complementary direction: not to select observables, but to establish formal lower bounds on spoofing cost [1, §6]. The two approaches are methodologically compatible — Jacobian-based observable selection could in principle be composed with IT-PUF's adversarial robustness guarantees — though the combination has not been tested. Interaction with concurrent methods. The present paper's contributions interact with the concurrent landscape at three specific points. First, the alignment diagnostic (§3) addresses a vulnerability that applies to all scalar-threshold fingerprinting methods: geographic aliasing in the metric space. Any framework that defines fingerprint similarity via a single distance number — including REEF's CKA distance, Yoon et al.'s weight-statistic correlation, and ZeroPrint's AUC — is susceptible to the false-spoofing and false-failure modes documented in §3.1. The directional decomposition is not specific to IT-PUF; it applies wherever convergence is measured in an inner-product space. Second, the CRP metric (§4.3) addresses a centroid-collapse scaling failure that would affect any template-matching fingerprinting scheme at production density — the failure is geometric, not method-specific. Third, the zero-knowledge framework (§5) addresses a gap absent from all concurrent work: the question of how forensic evidence can be presented in adversarial legal proceedings without disclosing the assets at issue. Distinguishing properties. No existing method provides the combination of: (a) formal impossibility proofs for adversarial spoofing (352 theorems, 0 Admitted across the series), (b) empirical validation across both weight-access and API-only regimes with zero false acceptances across 23 + 14 models, (c) a directional provenance diagnostic that corrects the scalar-threshold failure mode shared by all concurrent approaches, and (d) a framework for zero-knowledge attestation. We welcome adversarial attempts to break the protocol; the measurement methodology is disclosed in sufficient detail for independent reproduction of the scientific claims.
7. Conclusion
Each paper in this series began where the previous one ended. Paper 1 [1] asked whether neural network identity could be measured and answered with a formal theory grounded in extreme value statistics: the \(\delta\)-gene observable is an architecture-invariant fingerprint, verified across 23 models in a Coq proof stack with zero uses of Admitted. Paper 2 [2] asked whether that measurement survived the API wall and answered with a verification protocol that operates through commercial logprob interfaces — no weight access required. Paper 3 [3] asked whether an adversary could erase the fingerprint and answered with an empirical resilience proof: passive fine-tuning is the most effective eraser, white-box adversarial attack does worse, and the geometry forbids simultaneous capability acquisition and trace erasure. This paper asked whether the framework scales — across model families, across architectures, across training protocols, across zoo density — and found that it does. Provenance transfer generalizes across 3 teacher families, 4 student architectures, and 2 training protocols, with the strongest signal arising in a cross-family configuration with no overlap with the original experimental pair. API endpoint verification scales from 6 to 14 models across 3 providers with zero breaches, provided three stacked mechanisms are deployed: centroid reference protocol, multi-session enrollment, and per-model adaptive thresholds. The \(\delta_\mathrm{norm}\) universal constant holds at 1.4% CV across the most diverse population yet measured, including the first MoE architecture. Along the way, we discovered that scalar convergence metrics — the standard tool in distillation forensics — discard the directional information that provenance detection requires. The alignment diagnostic is a directional decomposition using the law of cosines: it decomposes student movement into direction (\(\cos\theta\)) and magnitude, preserving trajectory information that distance-based scalars destroy. In two independent experiments, it gave the correct verdict where the scalar metric gave the opposite. The diagnostic is not specific to IT-PUF or to PPP-residual space; it applies to any distillation forensics framework operating in an inner-product space. We recommend it as standard practice alongside scalar convergence metrics, treating \(\mathrm{Conv}_T\) without \(\cos\theta\) as analogous to reporting a \(p\)-value without an effect size. Each paper in this series opened a door and closed a question. Paper 1 [1] proved the physics exists — \(\delta\) is structural, not emergent — and that structural identity is unforgeable under the IT-PUF threat model. It opened: does the fingerprint survive the API wall, where only top-\(K\) logprobs are visible? Paper 2 [2] proved the API wall is transparent to the \(\delta\)-gene observable via PPP residualization, and that API identity is conditionally unforgeable.
It opened: does provenance transfer through distillation, and can the adversary erase it? Paper 3 [3] proved provenance transfers through knowledge distillation and that adversarial erasure is geometrically defeated during active training — the attacker faces a Pareto frontier with no favorable region. It opened: does it generalize beyond the original experimental pair, and does it scale to production density? This paper answered both affirmatively, across 3 teacher families, 4 student architectures, 2 training protocols, and 14 API endpoints. The question it opens — whether forensic evidence can survive adversarial legal proceedings without disclosing the assets under dispute — is the Trust Paradox of §5, where measurement science meets cryptographic engineering. The Trust Paradox drove the zero-knowledge program. Forensic evidence that requires disclosing the very assets under dispute cannot survive adversarial legal proceedings. The three-tier architecture in §5 addressed this by progressively removing trust assumptions — from prover honesty (Tier 1) to hardware integrity (Tier 2) to cryptographic soundness alone (Tier 3). All three tiers are now validated. Tier 1 (committed distance proof) has been implemented and hardened. Tier 2 (hardware-attested measurement) has been validated on production confidential computing hardware. Tier 3 (full zero-knowledge extraction) has been validated: a complete circuit compiled and audited, all four pre-registered falsification criteria met, 124 adversarial negative tests with zero false acceptances. The pre-registered commitment was to report the failure if any criterion was not met. No failure occurred. The δ-gene program demonstrates that neural network identity is not emergent but structural — a consequence of the mathematics of high-dimensional output distributions, not of the training data or the training process. This structural identity admits formal impossibility proofs, survives adversarial attack, generalizes across model families, scales to production density, and can now be verified in zero knowledge. The remaining challenge is not scientific but institutional: building the trust infrastructure — zero-knowledge attestation, standardized measurement protocols, legal precedent for algorithmic provenance — that allows this science to inform regulatory and judicial proceedings. The legal frameworks will follow the engineering, not precede it. The breakthrough discoveries enabled by the full-stack validation — identity-conditioned inference, hybrid proof-and-bridge decomposition, output binding — are reported in [6]. The doctrine of zero Admitted extends to every layer of the stack. Six papers, four patents, 352 formally verified theorems across 17 Coq proof files, and zero uses of Admitted. The science is published. The measurements are reproducible. The geometry is structural, not emergent. The computation is verified.
References
View 14 references ↓
[1] Coslett, A. R. (2026a). The \(\delta\)-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry. Zenodo. DOI: 10.5281/zenodo.18704275.
[2] Coslett, A. R. (2026b). Template-Based Endpoint Verification via Logprob Order-Statistic Geometry. Zenodo. DOI: 10.5281/zenodo.18776711.
[3] Coslett, A. R. (2026c). The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing. Zenodo. DOI: 10.5281/zenodo.18818608.
[4] Coslett, A. R. (2026d). Which Model Is Running? Structural Identity as a Prerequisite for Trustworthy Zero-Knowledge Machine Learning. Zenodo. DOI: 10.5281/zenodo.19008115
[5] Anthropic. (2026). Detecting and countering fraudulent use of Claude. February 24, 2026. https://www.anthropic.com/research/detecting-and-countering-fraudulent-use-of-claude
[6] Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531.
[7] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., and Goldstein, T. (2023). A watermark for large language models. ICML 2023.
[8] Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML 2023.
[9] Pappu, R., Recht, B., Taylor, J., and Gershenfeld, N. (2002). Physical one-way functions. Science, 297(5589):2026–2030.
[10] Shao, S., Li, Y., He, Y., Yao, H., Yang, W., Tao, D., and Qin, Z. (2025). SoK: Large language model copyright auditing via fingerprinting. arXiv:2508.19843.
[11] Zhao, X., Ananth, P., Li, L., and Wang, Y.-X. (2023). Provable robust watermarking for AI-generated text. ICLR 2024.
[12] Zhang, J., Wu, D., Chen, S., Zhan, Y., and Zhou, J. T. (2025). REEF: Representation encoding fingerprints for large language models. ICLR 2025.
[13] Yoon, S., Lee, N., and Shin, J. (2025). Fingerprinting large language models via attention pattern analysis. arXiv:2503.01891.
[14] Shao, S., Li, Y., Yao, H., Chen, Y., Yang, Y., and Qin, Z. (2026). Reading between the lines: Towards reliable black-box LLM fingerprinting via zeroth-order gradient estimation. WWW 2026. arXiv:2510.06605.
Acknowledgments
Portions of this research were developed in collaboration with AI systems that served as assistants for formal verification sketching, adversarial review, and manuscript preparation. All scientific claims, formal proofs, and editorial decisions remain the sole responsibility of the author.
Patent Disclosure
The structural fingerprint measurement methodology described in this work is the subject of U.S. Provisional Patent Application 63/982,893 (weights-based identity verification, filed February 13, 2026). The API-based endpoint verification methodology is the subject of U.S. Provisional Patent Application 63/990,487 (filed February 25, 2026). The zero-knowledge attestation architecture described in §5 is the subject of U.S. Provisional Patent Application 63/996,680 (privacy-preserving model identity verification, filed March 4, 2026). The identity-conditioned inference verification architecture, hybrid proof-and-bridge decomposition, selective output verification, and evidence bundle system — the breakthrough results enabled by Tier 3 validation and reported in full in [6] — are the subject of U.S. Provisional Patent Application 64/003,244 (filed March 12, 2026).
Reproducibility Statement
This paper discloses the experimental methodology, metrics, and results in full. Operational parameters of the authentication system — challenge bank contents, bank size, per-model threshold values, and measurement implementation — are withheld as trade secrets protected by the provisional patents. We acknowledge that this creates a reproducibility barrier for the verification claims (0/182 breaches, \(K = 7\) floor). The scientific claims — provenance generalization, alignment diagnostic, \(\delta_\mathrm{norm}\) universality — are independently verifiable from the information in this paper: the alignment diagnostic requires only the law of cosines applied to any three-point distance measurement in an inner-product space, and \(\delta_\mathrm{norm}\) universality can be checked by any party with access to model logits. We will provide synthetic data that reproduces the reported margins and directional diagnostic results upon request (anthony@fallrisk.ai). The reference implementation is available under license for parties seeking to reproduce the verification system end-to-end.
Epistemological Classification
| Section | Content | Status |
|---|---|---|
| §2 | Multi-architecture provenance transfer | VALIDATED |
| §3 | Alignment diagnostic methodology | VALIDATED |
| §3.3 | Measurability threshold (\(d(B,T) > 1.0\)) | VALIDATED (empirical observation in PPP-residual space) |
| §3 | \(\cos\theta\) classification boundaries | CHOSEN (operational conventions, sensitivity-verified) |
| §4.1 | \(K \geq 7\) truncation floor | VALIDATED |
| §4.2 | Speculative decoding transparency | VALIDATED |
| §4.3 | 14-model zero-breach result | VALIDATED |
| §5 Tier 1 | Committed distance proof | VALIDATED |
| §5 Tier 2 | Hardware-attested measurement | VALIDATED (6 models, 1,536 measurements, 0 failures in CC enclave; CC-transparent fingerprints confirmed) |
| §5 Tier 3 | Full ZK extraction | |
| §5.4 | Eight Properties doctrine | |
| §5.5 | Falsification criteria |