Abstract
Reasoning distillation does not leave uniform traces across target base families. In the current sample, we measure structural and functional identity across reasoning-distillation derivatives in three base-architecture families (Llama, Qwen, and Mistral) at five model scales (1.5B to 70B). Structural displacements are family-graded: Mistral-family targets show scars of 7,701–8,518 times the acceptance threshold, Llama-family targets show 2,858–4,583 times, and Qwen-family targets show 141–516 times — a sixty-fold range across three families, with the third-family result persisting under an independently trained derivative using different training data. Functional consequences do not track structural magnitude uniformly: Llama derivatives show decisive functional hierarchy breaks, Qwen derivatives remain within their base neighborhood, and Mistral — despite having the loudest structural scar — shows only marginal functional displacement. The functional departure is low-rank at every tested scale but varies in character: G₁-dominant in Llama and Qwen families, with a sign-oscillating morphology in Mistral that suppresses centroid-level G₁ signal while preserving per-prompt dominance. The stiffness parameter at the measurement site is inversely ordered with structural scar magnitude across all three families. Fisher curvature, previously proposed as a candidate mechanism at small scale, does not order scar magnitudes correctly at production scale across families. These findings change how derivative identity claims should be interpreted: the expected displacement depends on the architectural context of the distillation, and the structural and functional layers can decouple — a model may show the loudest structural scar in the dataset while absorbing the functional perturbation.
1. Introduction
Reasoning distillation does not leave interchangeable traces across all target base families. The same teacher model can produce loud structural and functional displacement in one base family and quiet displacement in another — not because the declared distillation source differed, but in a pattern consistent with a family-mediated explanation rather than a uniform distillation effect. This paper measures that divergence directly across reasoning-distillation derivatives built on Llama, Qwen, and Mistral base models from 1.5B to 70B parameters.
Earlier work established that a neural network's structural identity and its functional behavior can diverge under distillation [12]: the structural fingerprint remains close to the pretrained baseline while the functional output partially inherits the teacher's characteristics [3, 4]. That separation holds across multiple teacher-student architectures, survives adversarial erasure, and is formally grounded in an admissibility framework that prohibits substituting evidence from one identity layer for claims about another [7, 8]. But these results left open an implicit assumption: that distillation is a single phenomenon with roughly uniform structural consequences across target families. This paper reports the result of testing that assumption. It does not survive.
We measured structural and functional identity across reasoning-distillation derivatives in three base-architecture families at five model scales. The structural scars are family-graded: Mistral-family targets show displacements of 7,701–8,518 times the acceptance threshold, Llama-family targets show 2,858–4,583 times, and Qwen-family targets show 141–516 times — a sixty-fold range. The functional consequences do not track structural magnitude uniformly. Llama derivatives are functionally farther from their bases than from unrelated models; Qwen derivatives remain within their base neighborhood; and Mistral derivatives — despite carrying the loudest structural scar in the dataset — show only marginal functional displacement. The structural and functional identity layers can decouple under distillation — in this third-family test, a result consistent with the admissibility framework but not previously observed empirically.
An earlier report in this series documented three of these pairs and noted a suggestive monotonic trend: structural separation appeared to increase with model scale [11]. That report explicitly caveated the observation as exploratory and flagged family confounding — Llama bases at the extremes, Qwen only in the middle. The present work confirms those caveats proved material. The apparent monotonic trend was largely an artifact of sample composition and seed concentration. The underlying pattern in the current sample is family-dependent rather than monotonically scale-dependent.
This changes the interpretation rule for model forensics and AI governance. If distillation consequences are family-graded, then forensic and attestation systems cannot assume a uniform "distillation signature." An enterprise deploying model-identity verification, a regulator auditing AI supply chains, or a model provider defending against unauthorized distillation should calibrate their expectations to the base family. A Qwen-based derivative may look structurally quiet — not because it is unmodified, but because the perturbation is expressed with smaller structural displacement in that family. A Mistral-based derivative may show the loudest structural scar in the dataset while absorbing the functional perturbation with minimal displacement. Structural loudness does not guarantee functional displacement — the layers respond through different mechanisms and can decouple.
The remainder of the paper is organized as follows. Section 2 introduces the measurement framework for readers unfamiliar with the series. Section 3 presents the structural family-dependence data across five distillation pairs in three families. Section 4 presents the functional family-dependence data across five triangles at three Qwen scales, one Llama scale, and one Mistral scale, including a shared-comparator robustness gate and the centroid-cancellation pathologies that affect two results. Section 5 analyzes the dominant mode of the functional departure across families and scales, including a sign-oscillating morphology observed in Mistral that is absent in the other families. Section 6 presents the cross-layer relationship: the structural and functional patterns correlate for Llama and Qwen but decouple for Mistral, extending the admissibility framework empirically. Section 7 presents mechanistic measurements: stiffness at the measurement site is inversely ordered with scar magnitude across three families, while Fisher curvature — tested at production scale — does not order them correctly. Section 8 discusses limitations and open questions, and Section 9 concludes.
2. Measurement Framework
This paper uses two measurement instruments developed in earlier work. This section provides the minimum context needed to interpret the results; full derivations and validation are in the cited references.
2.1 Structural Identity
A neural network's structural identity is a compact summary of how its internal geometry shapes output logits during inference. The measurement works by presenting a fixed set of challenge prompts to the model, collecting the logit distributions at designated internal sites along the layer-normalization-to-output-projection pathway, and computing a vector (denoted τ) that captures the cross-token statistical geometry of those distributions [1]. The challenge prompts are drawn from a fixed bank of 512 prompts spanning multiple task categories; the bank is described in [1] and its contents are held constant across all measurements in this series. Two models produce similar τ vectors if and only if they process the challenge prompts through similar internal geometry; they produce different τ vectors if their internal geometry differs, even if their outputs are superficially similar. The structural distance between two models is the L2 distance between their τ vectors. The acceptance threshold ε ≈ 10⁻⁴ is the noise floor of the measurement: re-measuring the same model under identical conditions produces distances below ε. Distances above ε reflect genuine structural differences. Throughout this paper, distances are reported as multiples of ε (e.g., "2,858×ε" means the structural distance is 2,858 times the measurement noise floor). The measurement uses four independent seeds — random number generator states that determine which prompts from a fixed bank are selected and in what order — to produce four independent distance measurements per model pair.
2.2 Functional Identity
A neural network's functional identity is measured through its output logit-gap spectrum. For each generated token, the model's top-K logits are sorted in descending order and the gaps between successive ranks are computed (G₁ = gap between rank 1 and rank 2, G₂ = gap between rank 2 and rank 3, and so on through G_K). These raw gaps are then PPP-residualized: a Power-Prior-Predictive fit (a 1/k decay model) is subtracted from each token's gap spectrum, removing the scale factor that varies across models and leaving a residual vector that captures the model's characteristic gap shape [2]. The template is the mean residual vector across all tokens and prompts — a K-dimensional summary of how the model distributes its confidence across logit ranks, after removing the shared scale structure [2]. Related approaches to analyzing token-level prediction geometry from internal representations exist in the interpretability literature [16]; the present method differs in operating on output logit gaps rather than intermediate-layer projections. The winner-gap residual (G₁) is the first component of this template, corresponding to how much the model's top-choice logit separates from its runner-up beyond what the power-law baseline predicts. Functional distances between models are computed as centroid L2 distances between templates, with per-prompt distances (CRP: per-prompt distance averaging across the challenge bank) used as a robustness check when centroid distances show cancellation pathologies.
2.3 Terminology
The δ-norm is a scale-free thermodynamic observable (the normalized third logit gap) that serves as a validation check — it should fall near 0.318 for any well-formed Transformer regardless of family or scale [1]. It is not the primary measurement instrument in this paper but appears in validation gates. The admissibility framework [8] prohibits using evidence from one identity layer (structural, functional, or thermodynamic) to certify claims about another. Structural measurements cannot prove functional provenance, and vice versa. This constraint is relevant in Section 6 where the structural and functional patterns both correlate and decouple. The Fisher curvature κ_F measures how much Fisher information the model concentrates in the structural observable's direction at the measurement site. The damping diagnostic ρ_F = κ_F / λ (where λ is a regularization parameter) indicates whether the Fisher measurement reflects genuine model geometry (ρ_F ≥ 1) or is dominated by the regularization itself (ρ_F < 1). A damping-dominated measurement is not interpretable as Fisher curvature and is excluded from cross-model comparison.
3. Structural Family-Dependence
The structural identity of a neural network — measured as the distance between its behavioral fingerprint and that of another model — is invariant to distillation: previous work on distillation forensics (The Geometry of Model Theft [3]) showed that a distilled derivative's structural fingerprint remains within noise of its own pretrained baseline and far from its teacher's. This extends the broader model fingerprinting literature [13, 14] to the distillation-specific setting. What was not previously measured is whether the magnitude of the structural scar — the displacement between the derivative and its base — depends on the base family.
We measured structural distance for five reasoning-distillation pairs across three base-architecture families (Llama, Qwen, and Mistral) at five scales. Four pairs are official DeepSeek-R1 distillations with declared lineage. The fifth — Dolphin 3.0 R1 on Mistral-Small-24B — is a community distillation using an independent 800k reasoning-trace dataset, providing a recipe-agnostic third-family test. Structural distances were measured using the canonical measurement protocol with four independent seeds per model [1].
Table 1. Structural separability of reasoning-distillation pairs. Non-max seeds column reports all seeds excluding the maximum-distance seed; max seed reported separately to expose concentration anomalies.
| Derivative | Base Family | Scale | Non-Max Seeds (×ε) | Max Seed (×ε) |
|---|---|---|---|---|
| DeepSeek-R1-Distill-Llama-8B | Llama | 8B | 1,264–2,700 | 2,858 |
| DeepSeek-R1-Distill-Qwen-14B | Qwen | 14B | 449–516 | 3,616 |
| DeepSeek-R1-Distill-Qwen-32B | Qwen | 32B | 141–151 | 273 |
| DeepSeek-R1-Distill-Llama-70B | Llama | 70B | 1,815–4,321 | 4,583 |
| Dolphin-3.0-R1-Mistral-24B (community) | Mistral | 24B | 7,701–8,234 | 8,518 |
A three-family structural split is visible in the current sample. Mistral produces the loudest structural scars (7,701–8,518×ε), Llama produces intermediate scars (1,264–4,583×ε), and Qwen produces the quietest scars (141–516×ε). All five pairs are decisively above the operational detection floor — the quietest Qwen pair at 141×ε exceeds the closest known adversarial attack (10.7×ε) by more than an order of magnitude. The quiet scars are quiet relative to Llama and Mistral, not quiet relative to the measurement.
The Mistral result is particularly informative because it uses a different training recipe (community 800k reasoning traces rather than DeepSeek's official dataset). The persistence of the family-graded pattern across recipes strengthens the interpretation that base-family geometry plays a major role in scar magnitude, even across recipe changes.
The max-seed distance at 14B (3,616×ε) initially appeared to place Qwen within the Llama range, suggesting a monotonic trend reported as exploratory in Post-Hoc Disclosure Is Not Runtime Proof [11]. Closer examination revealed this was a concentration anomaly: seed 123 at 14B has a Gini coefficient of 0.96, with a single measurement dimension carrying 50.2% of the total squared distance. The remaining seeds cluster at 449–516×ε. These anomalies appear to reflect specific seed-to-dimension alignments in the measurement rather than broad changes in the distillation process itself.
In the current sample, family differences dominate the earlier monotonic-scaling interpretation; the data do not support a monotonic within-family increase with scale.
4. Functional Family-Dependence
If structural scars are family-graded, is the functional layer similarly affected? We measured functional identity using PPP-residualized logit-gap templates (Template-Based Endpoint Verification via Logprob Order-Statistic Geometry [2]), constructing triangles of three models — a base, its declared-lineage derivative, and a cross-family reference — and comparing the distillation distance (base to derivative) against the cross-family distance (base to cross-family reference).
A hierarchy break occurs when the derivative is functionally farther from its base than the cross-family reference is — meaning the distillation displaced the model's functional geometry past what mere family membership would predict. The ratio D/C (distillation distance over cross-family distance) quantifies this: D/C > 1 indicates a hierarchy break; D/C < 1 indicates the derivative remains functionally within its base's neighborhood.
4.1 Experimental Design
We constructed five functional triangles spanning three Qwen scales, one Llama scale, and one Mistral scale. Each triangle uses a scale-appropriate cross-family reference to reduce scale confounds in the cross-family distance. A shared comparator was used to build secondary triangles as a binding robustness gate: no paper-level outcome was declared unless the primary and shared-comparator triangles agreed on the binary break/no-break verdict.
All models were measured using greedy decoding with 40 canonical prompts, producing PPP-residualized templates of 20 dimensions (gap spectrum depth K=20). Self-baseline distances were computed from five random half-splits of each model's prompt responses.
4.2 Results
Table 2. Functional hierarchy-break analysis across five triangles. D/C = distillation distance / cross-family distance; D/C > 1 indicates the derivative is functionally farther from its base than from an unrelated model.
| Scale | Family | D/C (Primary) | D/C (Shared) | Agree? | Verdict |
|---|---|---|---|---|---|
| 1.5B | Qwen | 0.54× | 0.52× | Yes | No break (robust) |
| 7B | Qwen | 0.48× | 0.41× | Yes | No break (robust) |
| 8B | Llama | 1.66× | 5.59× | Yes | Break (robust) |
| 14B | Qwen | 1.07× | 0.99× | No | Marginal / reference-sensitive |
| 24B | Mistral | 1.15× | 2.34× | Yes | See §4.5 |
The Llama-8B triangle shows a robust hierarchy break under both references. The three Qwen triangles show absent-to-marginal break: robustly absent at 1.5B and 7B (D/C = 0.41–0.54), and reference-sensitive at 14B. The Mistral result requires separate analysis (§4.5).
Contextual reference: an earlier frontier measurement at 70B [11] found D/C = 1.33 for the Llama-70B pair. This measurement was conducted under the equivalent logprob-domain PPP protocol whose domain-invariance is established in the companion note [forthcoming]; it is not part of the matched triangle design, but it is consistent with the Llama pattern.
4.3 The 14B Reference Sensitivity
The 14B Qwen result appears reference-sensitive because of a geometric coincidence in PPP space rather than measurement instability. The derivative and the shared comparator have nearly identical values of their leading PPP residual component, differing by only 0.030 in a dimension where the base-to-derivative displacement is 1.09. This is the same capability-topology mechanism documented in The Geometry of Model Theft [3, §5.4], where models at similar capability levels were PPP-space neighbors regardless of corporate lineage.
4.4 The 8B Shared-Comparator Pathology
The 8B shared-comparator result (D/C = 5.59) is inflated by centroid-averaging cancellation, not by a genuine amplification of the hierarchy break. The centroid distance between Llama-8B and Mistral-24B (0.186) is below Llama-8B's own self-baseline distance (0.264), while the per-prompt distance (CRP) between the same pair is 1.09 — a 5.9× discrepancy. This is a known pathology of centroid-based distance metrics at high model density (Provenance Generalization and Verification Scaling [4]). The primary reference result (D/C = 1.66) is the more reliable estimate.
4.5 The Mistral Functional Result: Centroid vs. Per-Prompt Interpretation
The Mistral functional triangle requires careful interpretation because centroid and per-prompt metrics diverge.
The centroid-based D/C ratios (1.15 primary, 2.34 shared) both indicate a hierarchy break, and the centroid-based references agree on the binary verdict. However, this verdict is not stable under per-prompt interpretation. The centroid distillation distance (0.392) is below the Mistral base model's own self-baseline distance (0.403) — a ratio of 0.97. When a centroid distance falls below the model's self-noise floor, it is not interpretable as a genuine functional displacement. The functional paper-level classification for Mistral is therefore revised from centroid to CRP, because centroid falls below self-baseline and the shared comparator is itself centroid-pathological.
The per-prompt distance (CRP) tells a different story: CRP D/C under the primary reference is approximately 1.02 — essentially no functional hierarchy break. The shared-comparator centroid (D/C = 2.34) is inflated by the same centroid-cancellation pathology documented in §4.4: the centroid distance between Mistral-24B and Qwen-14B (0.167) is far below the Mistral base's self-baseline (0.403), while the CRP between the same pair is 1.483 — an 8.9× discrepancy.
The most defensible interpretation: Mistral shows the loudest structural scar in the dataset (§3) but only marginal-to-absent functional displacement under per-prompt analysis. This decoupling is discussed in §6.
5. The Dominant Mode of Functional Departure
Across the five functional triangles plus the 70B contextual reference, the functional departure between base and derivative is low-rank at every tested scale, but the character of the dominant mode varies by family.
The domain-invariance of PPP gap measurements — which we prove in a companion note (Gap Invariance: Why PPP Measurements Are Domain-Independent by Construction [forthcoming]) — ensures that the concentrations reported here are properties of the models, not artifacts of whether the measurement was conducted in logit or logprob space.
Table 3. Winner-gap (G₁) concentration across tested scales. G₁ is the first component of the PPP-residualized template. "G₁ Share" is the fraction of total squared centroid distance attributable to G₁. "Direction" indicates whether the derivative's G₁ is larger (Up) or smaller (Down) than the base's, with the parenthetical value showing the signed displacement in PPP-residualized units.
| Scale | Family | G₁ Share (centroid) | G₁ Direction | Top-3 Share | Note |
|---|---|---|---|---|---|
| 1.5B | Qwen | 99.9% | Up (+1.15) | 99.9% | |
| 7B | Qwen | 73.3% | Down (−0.25) | 98.5% | Isolated reversal |
| 8B | Llama | 97.8% | Up (+1.02) | 99.9% | |
| 14B | Qwen | 95.0% | Up (+1.09) | 99.3% | |
| 70B | Llama | ~98% | Up | — | Contextual |
| 24B | Mistral | 0.6% | Down (−0.03) | 81.8% | Sign-oscillating; see §5.1 |
At four of the five Llama/Qwen scales, reasoning distillation acts predominantly as a winner-gap amplifier: it widens the gap between the model's top-ranked logit and its runner-up, concentrating more than 95% of the total functional displacement in this single dimension. The 7B Qwen derivative is an isolated exception where G₁ reverses. The Mistral derivative introduces a qualitatively different functional response mode (§5.1).
5.1 The Mistral Sign-Oscillation Morphology
The centroid G₁ share for Mistral (0.6%) is misleading because it masks a large per-prompt signal that cancels in aggregation.
Per-prompt analysis reveals that G₁ is the dominant component of the functional displacement in 27 of 40 prompts (68%), with G₁ share exceeding 50% in 23 of 40 prompts. In this respect, Mistral behaves like every other tested family: the winner-gap carries most of the prompt-level functional difference. But in Llama and Qwen, the winner-gap displacement points the same direction across nearly all prompts — the derivative is consistently more confident (or consistently less confident) than the base. In Mistral, it splits: 18 prompts show the derivative more confident in its top choice than the base, while 22 show the derivative less confident. The per-prompt G₁ displacement is large (mean absolute value = 0.87), but the net is nearly zero (mean signed value = −0.03).
This produces 29.3× centroid cancellation: the centroid sees only the small residual of a large bidirectional signal. By contrast, the G₂ component is directionally consistent (31/40 prompts positive, cancellation ratio 1.2×), which is why G₂ appears dominant in the centroid view while G₁ is dominant per-prompt.
The practical consequence is that centroid-based functional distance metrics are unreliable for Mistral-family derivatives. A centroid-based forensic system would classify this derivative as functionally near-identical to its base, missing the large per-prompt displacements that split by direction. CRP (per-prompt distance averaging) captures the signal because it operates on absolute distances, not signed centroids.
This morphology — large G₁ displacement that splits in direction across prompts — has not been observed in any Llama or Qwen derivative in the current sample. In those families, G₁ displacement is directionally consistent across prompts. Whether the sign oscillation reflects a property of the Mistral base architecture, the Dolphin training recipe, or some interaction between the two is not determined by this experiment. What it does establish is that the functional response to distillation varies not only in magnitude across families, but in character — a distinction that matters for the cross-layer relationship examined next.
6. The Cross-Layer Relationship
The structural and functional findings of the preceding sections reveal a family-dependent relationship between identity layers that is more complex than uniform correlation.
Table 4. Cross-layer summary of distillation consequences. "Derivative CV Effect" is the ratio of the derivative's PPP template coefficient of variation to the base's.
| Family | Structural Displacement | Functional Hierarchy Break | Derivative CV Effect | Cross-Layer Pattern |
|---|---|---|---|---|
| Llama | Loud (2,858–4,583×ε) | Decisive (D/C = 1.33–1.66) | Doubled (2.03–2.04×) | Coupled: both layers respond loudly |
| Qwen | Quiet (141–516×ε) | Absent-to-marginal (D/C = 0.48–1.07) | Inconsistent (0.85–1.45×) | Coupled: both layers respond quietly |
| Mistral | Loudest (7,701–8,518×ε) | Marginal (CRP D/C ≈ 1.02) | Stable (0.95×) | Decoupled: loud structural scar, marginal functional displacement |
For Llama and Qwen, the structural and functional layers respond to the same distillation event with correlated magnitudes: loud structural scars co-occur with decisive functional breaks (Llama), and quiet structural scars co-occur with absent functional breaks (Qwen).
The Mistral result reveals that this correlation is not universal. Mistral carries the loudest structural scar in the dataset — roughly twice the loudest Llama pair — yet its functional displacement is marginal under per-prompt analysis. The structural and functional layers decouple: the same perturbation that produces a large structural displacement passes through the functional layer with minimal net effect.
This decoupling is consistent with the formal admissibility framework (What Counts as Proof? [8]), which prohibits using evidence from one identity layer to certify claims about another. The Mistral result demonstrates why: a structural measurement showing 8,518×ε displacement would lead to a confident inference of functional displacement if cross-layer correlation were assumed. That inference would be wrong. The layers are operationally independent, and the Mistral family provides an empirical demonstration of this possibility.
6.1 The Derivative CV Effect
An independent functional observable tracks the family effect partially. Reasoning distillation consistently doubles the PPP template coefficient of variation for Llama derivatives: 2.03× at 8B and 2.04× at 70B. Qwen derivatives show no consistent pattern (0.85–1.45×). The Mistral derivative shows 0.95× — no functional-noise amplification despite carrying the loudest structural scar. This reinforces the decoupling: structural vulnerability does not predict functional volatility.
7. Mechanistic Measurements: Stiffness and Fisher Curvature
The preceding sections establish that distillation consequences are family-graded in magnitude, mode, and cross-layer coupling. This section reports two mechanistic measurements conducted at production scale to test whether static architectural properties predict scar magnitude.
7.1 Stiffness at the Measurement Site
The stiffness parameter S = γ × σ_w — the product of the RMSNorm gain and the output-projection weight standard deviation at the exact site where the structural observable is measured — was computed for all three base families.
Table 5. Stiffness at the structural measurement site. Only models with matched structural scar measurements from §3 are included.
| Model | Family | Scale | S (measurement site) | Structural Scar (non-max ×ε) |
|---|---|---|---|---|
| Mistral-Small-24B-Instruct | Mistral | 24B | 0.0165 | 7,701–8,234 |
| Llama-3.1-8B-Instruct | Llama | 8B | 0.0334 | 1,264–2,700 |
| Qwen-2.5-14B-Instruct | Qwen | 14B | 0.0445 | 449–516 |
Stiffness for Qwen-2.5-7B-Instruct (S = 0.0455) is comparable to Qwen-14B (S = 0.0445), confirming within-family consistency but omitted from the scar comparison because no matched Qwen-7B distillation pair exists in this study.
Lower stiffness is ordered with louder structural scars across all three families. The ordering is monotonic: Mistral (lowest S, loudest scar), Llama (middle S, middle scar), Qwen (highest S, quietest scar). This is consistent with the geometric-buffer hypothesis: high RMSNorm gain at the measurement site may absorb distillation-induced perturbations, while low gain leaves the structural geometry more exposed.
Stiffness is a static weight-level observable — it requires no inference, no gradients, and no challenge prompts. It provides an independently measurable architectural signature that predicts, in the current three-family sample, which families will produce loud versus quiet structural scars under reasoning distillation.
7.2 Fisher Curvature at Production Scale
Fisher curvature κ_F — the directional curvature of the Fisher information matrix in the structural observable's direction [15] — was measured at production scale (7B–24B) to test whether the 100× Llama/Qwen asymmetry previously observed at ~1B scale [1, §4] persists at the scales where distillation scars are measured.
Table 6. Fisher curvature at the structural measurement site.
| Model | Family | Scale | κ_F | ρ_F | Damping Status |
|---|---|---|---|---|---|
| Mistral-Small-24B-Instruct | Mistral | 24B | 153.67 | 15.37 | Fisher-dominated |
| Qwen-2.5-7B-Instruct | Qwen | 7B | 60.26 | 6.03 | Fisher-dominated |
| Qwen-2.5-14B-Instruct | Qwen | 14B | 25.73 | 2.57 | Fisher-dominated |
| Llama-3.1-8B-Instruct | Llama | 8B | 8.60 | 0.86 | Damping-dominated |
Fisher repeatability gate: κ_F measured on two independent prompt subsets for Qwen-7B differed by 1.58× (gate threshold: 3.0×). Fisher measurements passed the pre-registered reliability gate.
However, the Fisher ordering does not match scar magnitude across families. The ~1B result (Llama κ_F ≈ 3,620 >> Qwen κ_F ≈ 36) predicted Llama-loud and Qwen-quiet, which matched the two-family observation. At production scale, the ordering reverses for Qwen vs Llama: Qwen shows higher Fisher curvature (25.7–60.3) than Llama (8.6), yet Qwen produces quieter scars. The Llama-8B measurement is damping-dominated (ρ_F = 0.86 < 1.0), meaning its κ_F value reflects the damping regularization rather than genuine Fisher geometry, further limiting cross-family comparison.
Mistral's Fisher curvature (153.67) is consistent with its loud scar, but a single consistent data point does not rescue the cross-family prediction. Two interpretations are possible and should not be conflated. First, Fisher curvature may genuinely fail to predict scar magnitude across families at production scale — the mechanism proposed at ~1B does not generalize. Second, the Llama-8B measurement may be unreliable because it is damping-dominated (ρ_F < 1.0), meaning the apparent ordering reversal could reflect a measurement limitation rather than a genuine Fisher property. Both interpretations lead to the same operational conclusion: Fisher curvature as measured here cannot be used as a cross-family scar predictor at production scale. Whether a differently damped measurement or a larger Llama model would restore the ordering is an open question. Stiffness (§7.1) remains the better-supported mechanistic candidate in the current sample because it does not depend on gradient-based estimation and produces an unambiguous ordering across all three families.
7.3 The Formation Connection
If Fisher curvature does not explain the family gradient at production scale, the remaining explanation may lie upstream — in how family-specific geometries are shaped during pretraining. The identity formation account (Where Identity Comes From [10]) establishes that structural identity stabilizes during pretraining through a path-sensitive developmental process. If different families undergo different formation dynamics, then the character of the locked structural geometry may itself be family-dependent. The available measurements are consistent with different families locking in geometries with different stiffness profiles at the measurement site. Both are locked by the time distillation begins, but they locked differently, and the different locked geometries respond differently to the same perturbation.
This connects formation [10] to forensics (the present work): the structural vulnerability of a model to distillation-induced scarring may be a downstream consequence of its developmental trajectory.
8. Limitations and Open Questions
The findings reported here are bounded by the current sample and the available distillation pairs.
Declared lineage. This paper treats publicly stated base-model lineage as given. The stated lineage has not been independently verified through structural provenance measurement; the structural measurements in Section 3 are consistent with the declared lineage, but consistency is not verification.
Tokenizer and vocabulary. A potential concern is mismatch in tokenizer vocabulary size (Llama 128k, Mistral 131k, Qwen 152k). However, the structural observable τ is extracted at internal hook sites that operate entirely in hidden-dimension space upstream of the output vocabulary projection; vocabulary dimensionality is irrelevant. The functional PPP templates are computed on sorted top-K logit gaps after power-law residualization; these depend only on the relative ordering and spacing of the highest-probability tokens, not on the total number of logits. The tokenizer confound does not apply to the observables used in this work.
Recipe confound. The Mistral pair uses a community recipe (Dolphin 800k reasoning traces) rather than DeepSeek's official traces. This is consistent with recipe-robustness — the family-graded pattern remains visible across independent training datasets — but introduces a minor confound: the Mistral result cannot be directly compared to the Llama/Qwen results as if all used identical training procedures. Future work with a matched-recipe Mistral pair would isolate the family effect from any recipe contribution.
Family coverage. The family-graded pattern is observed across three families. Whether it generalizes to Gemma, Phi, or other architectural lineages is unknown. Gemma-2 remains the most informative falsification target because of its distinct logit soft-capping architecture.
Centroid vs. per-prompt distance. Two functional results (the 8B Llama shared-comparator and the Mistral shared-comparator) are affected by centroid-averaging cancellation, where models that are PPP-space neighbors produce artificially low centroid distances while maintaining large per-prompt distances. The per-prompt analysis in §4.5 and §5.1 provides the corrective interpretation. Centroid distances should not be interpreted in isolation when the centroid falls below a model's self-baseline noise floor.
Fisher at production scale. The Fisher curvature measurement at 7B–24B (§7.2) does not replicate the ~1B cross-family ordering. This is a genuine falsification: the mechanism proposed in earlier work does not survive at the scales where distillation scars are measured. Stiffness (§7.1) survives as the leading mechanistic candidate but has been tested on only three families.
The monotonic trend. An earlier report [11] documented three distillation pairs whose structural separation appeared to increase monotonically with scale. The present work confirms those caveats: the apparent increase was an artifact of sample composition and seed concentration. The pattern is family-dependent rather than scale-dependent.
9. Conclusion
The structural and functional consequences of reasoning distillation are family-graded in the current sample, and the grading affects not only magnitude but mode and cross-layer coupling.
Structurally, Mistral-family targets show the loudest scars (7,701–8,518×ε), Llama-family targets are intermediate (2,858–4,583×ε), and Qwen-family targets are quietest (141–516×ε) — a sixty-fold range across three architectural families, with the pattern remaining visible when tested on a third family using independently curated training data. Functionally, the layers do not track structural magnitude uniformly: Llama shows decisive hierarchy breaks, Qwen shows absent-to-marginal breaks, and Mistral — carrying the loudest structural scar in the dataset — shows only marginal functional displacement. The structural and functional identity layers can decouple, a result consistent with the formal admissibility framework but not previously observed empirically.
The functional departure is low-rank at every tested scale but varies in character across families. In Llama and Qwen, the winner-gap residual dominates with directional consistency. In Mistral, the winner-gap oscillates in sign across prompts, producing large per-prompt G₁ displacement that cancels in the centroid — a morphology not observed in any other tested family.
In the current sample, the stiffness parameter at the structural measurement site inversely orders with scar magnitude across all three families, providing a static, weight-level observable that distinguishes loud-scar from quiet-scar architectures without requiring inference. Fisher curvature, previously proposed as a candidate mechanism, does not correctly order scar magnitudes across families at production scale.
These findings change how derivative identity claims should be interpreted. A forensic or attestation system that assumes distillation leaves a uniform signature will misread the significance of the same derivative event across families. More importantly, a system that assumes structural displacement predicts functional displacement will be wrong for families like Mistral where the layers decouple. In the current sample, the expected displacement of a derivative depends on the architectural context of the distillation, and that dependence operates differently in the structural and functional identity layers. For practitioners building model-identity verification systems, the implication is direct: detection thresholds, evidence weighting, and cross-layer inference rules should be calibrated to the base family to avoid both false confidence and false alarms.
References
View 16 references ↓
[1] A. R. Coslett, "The δ-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry," 2026. DOI: 10.5281/zenodo.18704275
[2] A. R. Coslett, "Template-Based Endpoint Verification via Logprob Order-Statistic Geometry," 2026. DOI: 10.5281/zenodo.18776711
[3] A. R. Coslett, "The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing," 2026. DOI: 10.5281/zenodo.18818608
[4] A. R. Coslett, "Provenance Generalization and Verification Scaling for Neural Network Forensics," 2026. DOI: 10.5281/zenodo.18872071
[7] A. R. Coslett, "The Deformation Laws of Neural Identity," 2026. DOI: 10.5281/zenodo.19055966
[8] A. R. Coslett, "What Counts as Proof? Admissible Evidence for Neural Network Identity Claims," 2026. DOI: 10.5281/zenodo.19058540
[10] A. R. Coslett, "Where Identity Comes From: Path Sensitivity and Endpoint Underdetermination in Neural Network Training," 2026. DOI: 10.5281/zenodo.19118807
[11] A. R. Coslett, "Post-Hoc Disclosure Is Not Runtime Proof: Model Identity at Frontier Scale," 2026. DOI: 10.5281/zenodo.19216634
[12] G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network," arXiv:1503.02531, 2015.
[13] Y. Chen, X. Shen, S. Ji, J. Chen, and T. Wang, "Teacher Model Fingerprinting Attacks Against Transfer Learning," USENIX Security, 2022.
[14] M. Lederer, G. Marra, K. Groh, and R. Samhammer, "A Systematization of Watermarking, Fingerprinting, Model Extraction and Ownership Verification of Machine Learning Models," arXiv:2312.09381, 2023.
[15] F. Kunstner, P. Balles, and P. Hennig, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent," NeurIPS, 2019.
[16] N. Belrose et al., "Eliciting Latent Predictions from Transformers with the Tuned Lens," NeurIPS, 2023.
Cite this paper
Acknowledgments
Portions of this research were developed in collaboration with AI systems that served as co-architects for experimental design, adversarial review, and manuscript preparation. All scientific claims, experimental designs, measurements, and editorial decisions remain the sole responsibility of the author. Experiments were conducted on Google Colab using NVIDIA A100-SXM4-80GB GPUs.
Author's Disclosure
Anthony Ray Coslett is the founder of Fall Risk AI, LLC, which holds the provisional patents listed below. The structural identity measurement described in this paper operates within the scope of that intellectual property. No external funding was received for this research.
Patent Disclosure
U.S. Provisional Patent Applications 63/982,893, 63/990,487, 63/996,680, and 64/003,244 are assigned to Fall Risk AI, LLC.