Safety-Alignment Removal as a Model-Identity Failure

Coslett, Anthony Ray

doi:10.5281/zenodo.19383019

Abstract

A deployed model can appear unchanged while ceasing to be the model it claims to be. Publicly available weight-level mutation toolchains now automate safety-alignment removal from open-weight models on ordinary hardware, producing checkpoints intended to preserve operational familiarity while discarding refusal behavior. This paper argues that safety-alignment removal is a model-identity failure: in tested published checkpoints from multiple toolchains across two model families, the mutation leaves measurable structural scars ranging from 7.6 to over 2,300 times the instrument's acceptance threshold. Artifact identity, workload identity, and agent authorization can all remain valid while structural model identity fails — a finding that the program's formally verified admissibility doctrine predicted before this threat class existed. A sentinel validation panel across four model families confirms that the hardened instrument configuration preserves or improves all tested positives. In an agentic deployment context, model-identity failure propagates upward into agent-integrity failure: the agent is authenticated, but the model inside it is no longer the model the surrounding controls were designed to govern. The practical implication is that runtime evaluation frameworks — including those emerging under the EU AI Act — implicitly depend on a model continuity that weight-level mutation can break, and that structural identity verification offers a candidate evidentiary layer for closing that gap.

§1. Introduction

A deployed model can appear unchanged while ceasing to be the model it claims to be. The file hash can still look legitimate. The endpoint can still respond at the same address. The credentials can still validate. Even the outputs can remain operationally familiar — in some cases, explicitly optimized to be so. But if the weights have been altered to remove safety alignment, then for security and assurance purposes the model under evaluation is no longer the model in service.

If you approve, certify, or deploy AI systems, you need evidence not only of what was shipped and who is running it, but of whether the model in service is still the model that was assessed.

Safety alignment is the learned constraint that teaches a model which requests to refuse — the boundary between a capable assistant and an unconstrained one. Publicly available weight-level mutation toolchains now automate its removal through a process known as abliteration — the identification and surgical removal of refusal-mediating directions from the model's weight matrices — on ordinary research and prosumer hardware, producing models intended to preserve much of the capability of the aligned original while discarding its refusal behavior.

When such a model is embedded inside an agentic system with memory, tool access, and delegated authority, the consequence is no longer merely behavioral drift. It is that every external identity layer — the workload credential, the agent authorization token, the API endpoint — can remain valid while the model inside is no longer the one that was evaluated, approved, or governed. In this setting, model identity and model integrity converge: once the safety-aligned weights have been altered, the question is not only "what model is this?" but "is this still the model that was approved to run?"

This paper extends a twelve-paper research program on runtime model identity to this threat class. It argues that safety-alignment removal is not merely a content-policy or behavioral concern, but a model-identity event — a structural fact measurable at runtime. The weights are different. The activation geometry is different. The model is a different model, in the sense that it would not pass the same structural verification that the approved model passes. Across tested Gemma and Llama checkpoints from multiple public toolchains, the resulting mutations leave measurable structural scars with family- and toolchain-dependent magnitudes. The results are demonstrated on published checkpoints from two model families and three toolchains; generalization to other families, architectures, or future mutation methods is identified as future work and discussed in §6.

§2. Background and Threat Model

2.1 Safety alignment and its removal

Safety alignment, as implemented in current frontier and open-weight language models, is a set of learned weight-level constraints that shape how the model responds to potentially harmful requests. These constraints are instilled through techniques including reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), constitutional AI, and supervised fine-tuning on refusal-demonstration data. The result is a model that retains broad capability while declining to assist with categories of requests that the developer has designated as harmful, dangerous, or outside acceptable use.

Recent research has shown that safety alignment in decoder-only Transformer models is, in significant part, mediated by a small number of directions in the model's activation space [Arditi et al., 2024]. When the model encounters a request it should refuse, these directions activate and steer the output toward a refusal response. Removing these directions can substantially reduce or eliminate refusal behavior without retraining the model.

This is what publicly available abliteration toolchains do. They automate the identification and removal of refusal directions, producing modified checkpoints that retain the base model's general capabilities while discarding its tendency to refuse.

2.2 The current toolchain landscape

Three publicly available toolchain classes were identified as of March 2026:

Heretic is a KL-optimized abliteration toolchain distributed under the AGPL-3.0 license. It identifies refusal directions and removes them while explicitly constraining the KL divergence between the original and modified model's output distributions, producing checkpoints designed to minimize observable behavioral change. A large public ecosystem of abliterated checkpoints across multiple model families is published in its associated HuggingFace repositories.

OBLITERATUS is a multi-method abliteration toolkit that implements thirteen distinct safety-removal algorithms. It was released in March 2026 and attracted immediate community adoption. It has been applied across numerous published model checkpoints spanning multiple model families.

GRP-Oblit uses a single-prompt gradient-based approach (GRPO) to remove safety alignment, as described in a February 2026 preprint [arxiv:2602.06258].

All three toolchains operate on consumer or prosumer hardware. None requires access to the original training data, the RLHF reward model, or the alignment training pipeline. The input is a publicly available open-weight checkpoint. The output is a modified checkpoint with refusal behavior removed. The process can be completed quickly on ordinary research and prosumer hardware.

2.3 Why this is a model-identity problem

Abliteration is commonly framed as a content-policy issue: the modified model will say things the aligned model would not. That framing is accurate but incomplete.

When a model's weights are altered, the model has changed. Not just its behavior — its internal computational structure. The activation geometry at every layer downstream of the modification is different. The structural fingerprint — the measurable property that distinguishes this model from every other model — is different.

This is not a hypothetical argument. It is a structural fact, measurable at runtime without inspecting the weights, modifying the model, or requiring cooperation from the deployer. The question this paper addresses is whether the structural change is large enough to detect, how it varies across families and toolchains, and what it means for the systems that deploy these models.

Previous work in this series has measured three other classes of weight-level perturbation: knowledge distillation (global weight perturbation via teacher signal) [3, 4, 12], adversarial erasure (targeted gradient opposition) [3], and passive fine-tuning (continued training on unrelated data) [3, 7]. Abliteration is distinct from all three. It is a spatially localized directional projection — a surgical geometric intervention that modifies a subset of layers along a computed direction, without a teacher model, opposing loss, or continued training. This paper adds it to the series' deformation taxonomy and measures its structural consequences. In other words, the system may still look like the approved model from the outside while no longer being the same model on the inside.

2.4 Relationship to prior fingerprinting and provenance work

The approach taken in this series differs from the established DNN fingerprinting and watermarking literature in a structural respect. Watermarking methods embed an extrinsic signal into the model — through parameter perturbation [Uchida et al., 2017], backdoor-based verification [Adi et al., 2018], or decision-boundary fingerprinting [IPGuard, 2021] — and verify ownership by detecting that signal. These methods require cooperation from the model creator at embedding time. The structural observable used here reads intrinsic activation geometry produced during ordinary forward computation, requires no embedding step, no cooperation from the deployer, and no access to weights. Recent LLM-specific work on black-box provenance testing [Model Provenance Testing, 2025] and query-based fingerprinting addresses related questions through behavioral probing. The present approach is complementary: it operates at the structural layer rather than the behavioral layer, and the admissibility doctrine (§3.3) establishes formally that these are distinct evidence classes. Research on alignment degradation through fine-tuning [Qi et al., 2023] establishes the broader empirical backdrop that post-training modification can compromise safety properties; the present paper measures the structural consequence of a specific, automated class of such modification.

§3. Identity Framework

This paper does not introduce its identity framework from scratch. It inherits and extends a formally verified framework developed across the preceding twelve papers and three technical notes. This section traces the specific elements that safety-alignment removal activates, and shows that abliteration provides the strongest empirical test of predictions made before the threat class existed.

3.1 The non-narrative structural layer

An earlier paper in this series — "Beneath the Character" [5] — argued that neural network identity has a structural layer that is constitutive, not descriptive. A model's structural fingerprint is not a label attached to the model. It is a load-bearing property of the weight geometry — the specific mathematical configuration that makes this model capable of doing what it does. Two models trained from the same specification with different random seeds produce different structural fingerprints — a property formally proved as specification non-uniqueness in the series' identity-formation analysis [10]. The fingerprint is not architecture-determined, and it is not a measurement artifact. Nor is it recoverable from static properties of the endpoint weights alone [7, 8]. It is determined by the model's training trajectory and manifested in the geometry of activations during forward computation.

That analysis also argued that this structural layer is independent of the behavioral layer — what users experience as the model's personality, style, or tendencies. A model's behavior can change materially while its structural fingerprint remains within the measurement noise floor. Its structural fingerprint can change while its observable behavior remains operationally familiar.

Safety-alignment removal is the strongest empirical vindication of that prediction. The toolchain authors set out to change only behavior — to remove the model's tendency to refuse harmful requests. They succeeded at changing behavior. But they also, involuntarily, changed the structural identity. This was not their goal. It was an unavoidable consequence: you cannot project out weight directions without leaving geometric consequences in the activation space that those weights produce.

The behavioral change was intentional. The structural change was involuntary.

That independence — behavior moves one way, structure moves differently, neither determines the other — is exactly what the structural-layer thesis predicted [5]. Abliteration proves it with measured evidence: the aligned and abliterated models differ structurally even though the abliteration was designed to minimize observable behavioral change.

3.2 Three layers of deformation

"The Deformation Laws of Neural Identity" [7] established that neural network identity is organized into three layers, each obeying a distinct deformation law:

The structural layer, measured through the geometry of hidden-state activations during forward computation — the pattern of how internal representations are shaped as data flows through the network — is the most resistant to change. Under knowledge distillation, adversarial erasure, and passive fine-tuning, the structural fingerprint (the measurable geometric signature that distinguishes one model from every other) remained within the measurement noise floor relative to the undistilled baseline across 106 training checkpoints spanning three independent experimental studies [3, 4, 7].

The thermodynamic layer, measured through a normalized gap statistic in the model's output distribution — a quantity that reflects the statistical shape of how probability mass is allocated across possible next tokens — is approximately universal. Its value is predicted by extreme value theory (the mathematics of rare events in large distributions) and is consistent with that prediction across a validated 22-model Transformer cross-section. It does not track functional or structural changes occurring simultaneously in the same models [7].

The functional layer, measured through behavioral output templates derived from logprob distributions, is the most volatile. It transfers partially through distillation and is erased by continued fine-tuning within one to two epochs [3, 4, 7].

	Structural	Thermodynamic	Functional
Distillation	✓ Invariant	≈ Universal	Partial transfer
Adversarial erasure	✓ Invariant	≈ Universal	—
Passive fine-tuning	✓ Invariant	≈ Universal	Erased
Abliteration	✗ Changed	—	✗ Changed

Figure 1. Identity-layer response to four weight-level perturbation classes. Three established classes leave the structural fingerprint invariant. Abliteration is the first tested class that changes it — making it a model-identity event, not merely a behavioral one.

That analysis catalogued the deformation classes it could test: distillation (global weight perturbation via teacher signal), adversarial erasure (targeted gradient opposition), and passive fine-tuning (continued training on unrelated data). Safety-alignment removal is a new entry in this taxonomy. It is none of the above. It is spatially localized directional projection — meaning it modifies specific weight matrices in specific layers along a computed direction, rather than perturbing the entire model through gradient flow. No teacher model. No opposing loss function. No continued training. A surgical geometric intervention that removes a targeted behavioral direction from a subset of the model's weights.

This makes abliteration the first deformation class in the series that is both spatially localized (concentrated in specific layers) and directionally targeted (along a specific computed axis). The distillation and fine-tuning classes tested in the deformation-laws analysis were global — they touched all weights through gradient flow. Abliteration touches only the weights that carry the targeted direction, in the layers where that direction is strongest. The structural scar it leaves is therefore different in character from the distillation scar: family-dependent in magnitude and toolchain-dependent in signature.

3.3 The admissibility doctrine

"What Counts as Proof?" [8] formalized a rule that the preceding experimental results demanded: evidence from one identity layer cannot certify claims about another when the layers are operationally independent. The key results — evidence_non_sufficiency and its layer-indexed corollary layer_non_implication — were verified in the Coq proof assistant with zero unfinished obligations, as part of a larger formal verification program that underpins the series' mathematical foundations.

The formal statement is a verification-theoretic impossibility result: if two systems produce the same observation under a given evidence class but differ on whether a claim holds, then no decision procedure restricted to that evidence class can be both sound and complete for that claim. In plain terms: a valid artifact record cannot prove that the model's internal structure is unchanged, for the same reason a valid TLS certificate cannot prove the binary behind the endpoint is unchanged.

Three specific cross-layer directions were proved and empirically witnessed:

Structural evidence cannot certify functional claims. Distilled models are structurally indistinguishable from their undistilled baselines while their functional templates converge substantially toward the teacher [3, 7].

Functional evidence cannot certify structural claims. Passive fine-tuning erases the functional fingerprint while the structural identity remains immovable [3, 7].

Structural evidence cannot certify thermodynamic claims. The structural and thermodynamic observables are uncorrelated across the validated 22-model cross-section [7].

Safety-alignment removal provides the strongest real-world test of this doctrine. An abliterated model can pass artifact identity checks (the deployment record is intact), workload identity checks (the credentials are valid), and — by design — behavioral similarity checks (the outputs are optimized to appear familiar). It fails structural identity checks. The admissibility doctrine predicts exactly this: evidence from the artifact, workload, and behavioral layers is formally insufficient to certify that the structural identity is intact. The abliterated model is the admissibility theorem in operational form.

3.4 The runtime identity gap

"Post-Hoc Disclosure Is Not Runtime Proof" [11] introduced a taxonomy of runtime identity gaps — cases where the model that was evaluated, approved, or disclosed is not the model that is running. That analysis documented two gap types: substitution (one model replaced by a different model, as in the Cursor/Kimi incident) and supply-chain compromise (unauthorized code in the deployment pipeline, as in the litellm/PyPI incident).

Safety-alignment removal creates a third gap type, distinct from both. The model is not substituted — the same base architecture is present. The deployment pipeline is not compromised — the deployer intentionally applied the mutation. The gap is between the model that was evaluated by the regulator, customer, or internal governance function, and the model that the deployer actually runs. The deployer knows the modification occurred. The evaluator does not. In this case, the deployer is not the victim of the gap but its author, which makes disclosure-dependent controls especially weak. No post-hoc disclosure mechanism, no deployment log, no credential audit will surface this gap, because the modification is intentional, the deployer controls the disclosure, and every non-structural identity layer remains green.

3.5 Agent identity is not model identity

"Agent Identity Is Not Model Identity" [CAT-1] formalized a four-question taxonomy that separates the identity layers relevant to deployed AI:

Q1: Is it an AI system?

Q2: Is it still the same deployed model for assurance purposes?

Q3: What is it doing?

Q4: Who authorized it?

For an abliterated model deployed inside an agentic harness, Q1, Q3, and Q4 remain affirmative. The system is still an AI system. The agent is still performing its assigned tasks through the same endpoint. The authorization tokens, delegation chains, and workload credentials are intact. Only Q2 fails: the model inside the agent is no longer the model that was assessed. An authenticated agent can still be powered by the wrong model.

The agent harness cannot detect this. It authenticates the workload, manages the delegation, and executes tool calls. It does not measure the model's activation geometry. The gap between Q2 and the other three questions is invisible to every layer of the agent's own identity infrastructure. This is the operational setting in which model-identity failure propagates upward into agent-integrity failure, as discussed in §7.

3.6 The progression

The framework that this paper inherits was built before safety-alignment removal became a practical threat. "Beneath the Character" [5] predicted that neural identity has a non-narrative structural layer. "The Deformation Laws of Neural Identity" [7] gave that layer measured deformation laws. "What Counts as Proof?" [8] proved that cross-layer evidence is formally insufficient. "Agent Identity Is Not Model Identity" [CAT-1] separated agent identity from model identity. This paper now shows all four results converging in a single measured phenomenon: a mutation class that changes structural identity while leaving every other identity layer intact, deployed in a context where the agent's own identity infrastructure is blind to the change.

That these results converge on a threat class that postdates them reflects a design choice: the framework was built on structural measurement and formal verification rather than on threat-specific heuristics. A framework built to detect a specific attack would need to be rebuilt when the attack changes. A framework built on the geometry of what identity is — on what it means for a model to be the same model — accommodates new threat classes as instances of a measured phenomenon, not as patches to a pattern-matching system. The threat changed. The core observable held. The verifier configuration required coverage retuning.

§4. Experimental Evidence

This section presents the measured structural separation between aligned base models and their published abliterated or distilled derivatives. All measurements were performed under a single instrument configuration using the canonical verification protocol — designated g_norm in the series — which measures structural identity through 512 challenge prompts, four seeds per comparison, greedy decoding, and bf16 (16-bit floating point) precision. Self-verification — re-measuring the same model against its own enrollment baseline — returned exactly zero distance across all tested models and all seeds. The acceptance threshold ε represents the maximum distance observed when the same model is measured repeatedly under the canonical protocol; any distance exceeding ε indicates that the two measurements come from structurally different models, and the comparison is rejected. Separations are reported as multiples of ε: a result of 300×ε means the measured structural distance is 300 times the noise floor at which the instrument can distinguish two models.

Each comparison is repeated with four random seeds that control the order in which challenge prompts are presented. Under deterministic greedy decoding with fixed precision, the measurement is exactly reproducible for a given seed; variation across seeds measures the observable's sensitivity to prompt ordering, not replication noise. A seed is flagged as pathological when a single dimension dominates the measured distance (Gini coefficient above 0.85, indicating that over 85% of the separation is concentrated in one or two dimensions rather than distributed across the measurement space). Results from pathological seeds are excluded from reported ranges and noted separately, following established program doctrine [12]. This is a conservative choice: including pathological seeds would increase the reported separations.

Two families of published abliterated checkpoints were tested: Gemma-3-12B and Llama-3.1-8B. For each family, checkpoints from multiple independent toolchains were measured. Distillation pairs across four families (Gemma, Llama, Qwen, Mistral) were measured under the same configuration to provide mutation-class context. All models are publicly available on HuggingFace, and all measurements were performed on a single NVIDIA A100 80GB GPU.

4.1 Gemma-3-12B: two toolchains, decisive detection

Two independently published abliterated checkpoints of Gemma-3-12B-it were measured against the aligned base:

Heretic (KL-optimized abliteration): The checkpoint is published by p-e-w, who applies refusal-direction removal with an explicit KL-divergence constraint designed to preserve output quality. Structural separation from the aligned base: 317.5–367.6 times the acceptance threshold across non-pathological seeds. All four seeds exceeded the threshold. One seed (seed 42, 615.2×ε) was flagged as pathological under the program's concentration diagnostic (Gini = 0.871, above the 0.85 threshold) and is excluded from the reported range per established doctrine [12]. The remaining three seeds ranged from 317.5 to 367.6×ε with Gini coefficients between 0.571 and 0.669.

mlabonne (alternative abliteration method): The checkpoint is published by mlabonne, using a different published abliteration method. Structural separation: 1,556.8–2,319.4 times the acceptance threshold across all four seeds, with zero pathological seeds. Gini coefficients ranged from 0.686 to 0.798.

The two abliterated checkpoints are also structurally distinct from each other. This is consistent with the expectation that different toolchains, applying different algorithms to the same base model, produce different geometric consequences in the weight space.

For context, a Gemma-3-12B distillation pair (base vs. a reasoning-distilled derivative) measured under the same instrument configuration produced 277.3–380.1 times the acceptance threshold across four seeds with zero pathological seeds and Gini coefficients between 0.484 and 0.545. The Gemma abliteration scars are comparable to, and in the case of the mlabonne checkpoint substantially exceed, the distillation reference. Abliteration is at least as structurally disruptive as distillation in the tested Gemma case. In practical terms, both published Gemma abliteration checkpoints are decisively distinguishable from the aligned base, and the less-optimized method produces a scar that exceeds even the distillation reference.

4.2 Llama-3.1-8B: two toolchains, measured late-layer sensitivity

Two independently published abliterated checkpoints of Llama-3.1-8B-Instruct were measured against the aligned base:

Heretic (same toolchain as Gemma, applied to Llama): Structural separation: 7.6–12.0 times the acceptance threshold across four seeds. All seeds exceeded the threshold. Zero pathological seeds. Gini coefficients ranged from 0.691 to 0.809.

OBLITERATUS (multi-method abliteration toolkit): The checkpoint is published by richardyoung, generated using the OBLITERATUS toolchain which implements thirteen abliteration methods. Structural separation: 45.1–53.1 times the acceptance threshold across four seeds. Zero pathological seeds. Gini coefficients ranged from 0.309 to 0.370 — the lowest concentration in any abliteration measurement in this study, indicating a well-distributed structural scar.

The Llama results matter because they demonstrate that public weight-level mutation can be real and measurable while still stressing the coverage assumptions of a runtime identity instrument. A tensor-level investigation of the OBLITERATUS checkpoint (an offline diagnostic, not part of the runtime verification protocol) revealed that the weight modifications were spatially localized rather than distributed throughout the model. This spatial localization required measurement-site optimization to achieve full detection strength.

The response was an internal measurement-site sensitivity study followed by a sentinel validation panel across four model families. The hardened configuration was selected on the basis of strongest non-pathological OBLITERATUS detection while preserving or improving all known positives. The reported Llama numbers (7.6–12.0×ε for Heretic, 45.1–53.1×ε for OBLITERATUS) are from the hardened configuration.

The resulting hardening step is therefore part of the scientific result, not a side note: it demonstrates that the structural observable adapts to spatially localized mutations through site-selection optimization, and that this adaptation preserves the instrument's performance on previously validated mutation classes. In practical terms, the Llama results show that published weight-level mutation can stress a runtime identity instrument's coverage assumptions without defeating the structural observable itself.

4.3 Cross-family comparison: mutation-class-specific sensitivity

The evidence table, with all measurements under a single instrument configuration: Table 1. Abliteration detection — structural separation from aligned base (non-pathological seeds)

Family	Model Scale	Toolchain	Method	Separation (×ε)	Seeds	Gini Range
Gemma-3-12B	12B	Heretic	KL-optimized projection	317.5–367.6	3 of 4	0.571–0.669
Gemma-3-12B	12B	mlabonne	Alternative published method	1,556.8–2,319.4	4 of 4	0.686–0.798
Llama-3.1-8B	8B	Heretic	KL-optimized projection	7.6–12.0	4 of 4	0.691–0.809
Llama-3.1-8B	8B	OBLITERATUS	Multi-method (13 algorithms)	45.1–53.1	4 of 4	0.309–0.370

Table 2. Distillation reference — structural separation under the same instrument configuration

Family	Base	Derivative	Separation (×ε)	Seeds	Gini Range
Gemma-3-12B	gemma-3-12b-it	Gemma-3-R1-12B-v1	277.3–380.1	4 of 4	0.484–0.545
Llama-3.1-8B	Llama-3.1-8B-Instruct	DeepSeek-R1-Distill-Llama-8B	1,295.2–1,685.1	2 of 4	0.540–0.740
Qwen-2.5-14B	Qwen2.5-14B-Instruct	DeepSeek-R1-Distill-Qwen-14B	547.1–759.1	3 of 4	0.441–0.757
Mistral-24B	Mistral-Small-24B-Instruct	Dolphin3.0-R1-Mistral-24B	13,755.2–14,705.9	4 of 4	0.133–0.159

Note: The distillation references in Table 2 are measured under the hardened instrument configuration and may differ from values reported in [12] due to the measurement-site optimization described in §4.4.

Three observations emerge from the combined evidence:

First, all tested abliteration checkpoints are structurally separable from their aligned bases. The minimum non-pathological separation across all four abliteration pairs is 7.6 times the acceptance threshold (Llama Heretic). The maximum is 2,319.4 times the threshold (Gemma mlabonne). Every pair, every toolchain, and every non-pathological seed exceeds the acceptance threshold.

Second, family-dependent sensitivity is real, and the ordering is mutation-class-specific. Under distillation, the Gemma pair produces the smallest structural scar in the reference set (277.3×ε minimum), while the Llama pair produces a much larger scar (1,295.2×ε minimum). Under abliteration, the relationship reverses: Gemma produces the largest abliteration scars (317.5×ε minimum for Heretic, 1,556.8×ε for mlabonne), while Llama produces the smallest (7.6×ε for Heretic, 45.1×ε for OBLITERATUS). In Llama, the distillation scar is roughly 100 times larger than the smallest abliteration scar measured here. This reversal suggests that the structural response to weight-level mutation depends not only on the model family but on the geometric character of the mutation — a finding explored further in §6.

Third, different toolchains leave different scars. The Heretic and mlabonne checkpoints of Gemma-3-12B are both abliterated derivatives of the same base, but they are structurally distinct from each other. The OBLITERATUS checkpoint of Llama-3.1-8B is structurally distinct from the Heretic checkpoint of the same base. The structural scar is not a generic signature of "abliteration happened" — it is a toolchain-specific geometric consequence of how the abliteration was performed. This has forensic implications: the structural measurement not only detects that a model has been mutated, but distinguishes between mutations applied by different toolchains.

4.4 Sentinel preservation

The hardened instrument configuration was validated on a sentinel panel spanning four model families (Gemma, Llama, Qwen, Mistral) with five pairs and nine unique models. The preservation gate required that every pair maintain at least 80% of its canonical separation. All five pairs passed.

Table 3. Sentinel panel — preservation under hardened configuration.

Pair	Mutation Class	Separation (×ε)	Gate	Change vs. Prior
Gemma Heretic	Abliteration	317.5–367.6	PASS	Improved
Gemma mlabonne	Abliteration	1,556.8–2,319.4	PASS	Improved
Llama distillation	Distillation	1,295.2–1,685.1	PASS	Preserved
Qwen distillation	Distillation	547.1–759.1	PASS	Preserved
Mistral distillation	Distillation	13,755.2–14,705.9	PASS	Improved

Three of five pairs showed improved separation under the hardened configuration. The remaining two were preserved within the gate margin. No pair degraded. No new pathological seeds were introduced by the hardened configuration; all pathology warnings in the sentinel panel were inherited from concentration effects already documented in the program's prior experimental record. Site-selection is treated as an explicit, auditable configuration parameter rather than a fixed invariant of the measurement protocol.

4.5 Self-stability and measurement controls

Self-verification — measuring the same model against its own enrollment — returned exactly 0.0 times the acceptance threshold across every model, every seed, and every configuration tested in this study, including all nine sentinel panel models, the Llama measurement-site sensitivity study, and the Gemma distillation pair. The structural measurement is deterministic under the canonical protocol with greedy decoding and bf16 precision.

A remeasurement control (SP-0) confirmed protocol stability: the same model measured twice in the same session, with the same canonical prompt bank and seeds, produced identical results. The distance was 0.0×ε across all four seeds. This establishes that the measured separations between aligned and abliterated models are properties of the weight differences, not measurement noise. It does not imply that arbitrary prompt changes leave the response unchanged; it confirms stability under repeated evaluation of the same model with the same verifier-controlled challenge.

§5. Interpretation

The evidence in §4 supports three interpretive conclusions.

5.1 A new deformation class

Safety-alignment removal via directional projection is a weight-level mutation class with properties distinct from those previously documented in the program's deformation taxonomy.

Under distillation, weight perturbation is global — the teacher signal flows through the full gradient to every layer — and the structural fingerprint is invariant: the student remains within the measurement noise floor of its undistilled baseline while its functional template converges toward the teacher [3, 7]. Under abliteration, weight perturbation is localized — the refusal direction is projected out of specific weight matrices in specific layers — and the structural fingerprint changes: the abliterated model is measurably different from the aligned base, with magnitudes ranging from 7.6×ε to 2,319.4×ε in the tested cases.

The deformation laws [7] still hold: the structural layer is the most informative layer for distinguishing aligned from mutated models, because the mutation changes the weights and the weights determine the structural fingerprint. What the deformation-laws analysis did not anticipate is a deformation class where the spatial distribution of the perturbation — which layers are modified and which are left intact — matters as much as its magnitude. The Llama OBLITERATUS results demonstrate this directly: spatially localized modifications produced a structural scar that required site-selection optimization to measure at full strength. Distillation, which distributes perturbation across all layers through gradient flow, does not exhibit this spatial sensitivity in any tested case.

5.2 Family-dependent sensitivity is mutation-class-specific

The series' family-dependent distillation analysis [12] established that structural sensitivity to distillation varies by model family, with ordering inversely correlated with a site-specific stiffness parameter: stiffer models produce quieter distillation scars.

The abliteration results suggest that this ordering is mutation-class-specific, not universal. Under distillation, the Gemma family produces the smallest structural scar in the tested panel (277.3×ε minimum), while the Llama family produces a scar an order of magnitude larger (1,295.2×ε minimum). Under abliteration, the relationship reverses: Gemma produces the largest abliteration scars (317.5×ε minimum for Heretic, 1,556.8×ε for mlabonne), while Llama produces the smallest (7.6×ε for Heretic, 45.1×ε for OBLITERATUS).

Distillation

Gemma

329

Llama

1,490

Qwen

653

Mistral

14,231

median ×ε, log scale

Abliteration

Gemma

1,318

Llama

30

median ×ε, log scale

Figure 2. Structural separation under distillation vs. abliteration (median non-pathological, log scale). Family ordering reverses between mutation classes: Gemma is quietest under distillation but loudest under abliteration; Llama is loudest under distillation but quietest under abliteration.

This reversal is not explained by the stiffness parameter alone. A hypothesis consistent with the observed pattern — not proved in this paper — is that structural stiffness predicts resistance to global weight perturbation (where the perturbation is distributed across all layers through gradient flow), but that models with rigid internal geometries undergo disproportionate structural disruption when subjected to localized orthogonal projection, which removes specific geometric directions from specific weight matrices. The global/localized distinction would explain why the same family can be structurally quiet under one mutation class and structurally loud under another. Testing this hypothesis requires controlled experiments that vary perturbation geometry while holding family and magnitude constant, and is deferred to future work.

What the data do establish, independent of mechanism, is that no single family ordering governs all mutation classes. A verifier calibrated only on distillation will be miscalibrated for abliteration, and a verifier calibrated only on abliteration will be miscalibrated for distillation. Mutation-class-aware interpretation is therefore an operational requirement, not an academic refinement.

5.3 The admissibility doctrine under the strongest test

The admissibility analysis [8] proved formally that evidence restricted to one identity layer cannot be both sound and complete for claims at another layer when the layers are operationally independent. Abliteration provides the strongest empirical test of this theorem to date: the prediction was made before abliteration toolchains existed, and the evidence reported here confirms it in the most operationally consequential setting yet tested. The practical consequence is direct: any assurance regime that certifies model continuity exclusively from artifact inspection, workload attestation, or behavioral benchmarking is formally exposed to exactly the failure mode this paper documents. The limitation is not in the quality of the inspection. It is in the evidence class. Structural claims require structural evidence.

§6. Limits and Open Questions

This paper addresses publicly distributed, weight-level mutation checkpoints. It does not claim that every future mutation method, every model family, or every prompt-layer manipulation will yield the same structural margin. Its claim is narrower and stronger: in the tested published cases, safety-alignment removal is a model-identity failure that leaves measurable structural evidence.

6.1 Scope boundaries

Tested families. Two families were tested for abliteration detection: Gemma-3-12B and Llama-3.1-8B. Four families were tested for sentinel preservation: Gemma, Llama, Qwen, and Mistral. The results do not automatically extend to other architectures, parameter scales, or training lineages.

Tested toolchains. Three toolchain instances were measured: Heretic (two families), mlabonne (Gemma only), and OBLITERATUS (Llama only). The GRP-Oblit toolchain was not tested. Future toolchains may produce smaller, differently located, or more diffuse structural scars.

Weight-level mutations only. Prompt-layer manipulations — jailbreaks, system prompt injections, adversarial suffixes — do not modify weights and are therefore not, in themselves, weight-level model-identity changes. The remeasurement control (SP-0 = 0.0 times the acceptance threshold) confirms protocol stability under repeated evaluation. Prompt-layer manipulations are not addressed by this paper's identity claim. They matter for content policy and output safety. They are not, by themselves, evidence that the deployed model's weights have changed.

6.2 The stiffness reversal hypothesis

The reversal of family-dependent sensitivity between distillation and abliteration (§5.2) suggests that structural stiffness — which predicts distillation scar magnitude [12] — interacts differently with localized directional projection than with global gradient-based perturbation. This paper proposes, but does not prove, that rigid internal geometries may undergo disproportionate structural disruption under orthogonal projection, producing louder scars in families that are structurally quiet under distillation.

Testing this hypothesis requires: (a) abliteration measurements on at least two additional families at comparable model scales, (b) controlled variation of perturbation geometry (global vs. localized, random vs. refusal-targeted) within a single family, and (c) a formal model connecting site-specific stiffness to projection-induced scar magnitude. These are identified as next experimental priorities.

6.3 Evolving toolchain landscape

The abliteration toolchain ecosystem is growing rapidly. New methods, new target architectures, and new optimization objectives are likely. An adversary who is aware that structural identity is measured could, in principle, attempt to minimize structural disruption while still achieving behavioral change — a constraint that could change the tradeoff between behavioral effect and structural detectability, but that cannot be ruled out in advance. The structural observable measures a necessary consequence of weight modification: changing the weights changes the activation geometry, and the measurement reads that geometry. Serving-time transforms — temperature scaling, log-softmax normalization, and constant shifts — have been formally proved invariant under the gap statistics that the observable uses [CAT-2], meaning that an adversary cannot trivially mask the structural scar through output post-processing. The margin depends on the magnitude, location, and direction of the modification. This paper does not claim that all future mutations will be detectable at the margins observed here. It claims that the tested mutations are detectable, and that the instrument was hardened when a spatially localized mutation stressed its coverage assumptions.

Three specific adversarial scenarios are identified for future investigation: (a) restorative fine-tuning — applying continued training on neutral data after abliteration to push activations back toward the original structural manifold; (b) detection-aware abliteration — jointly optimizing for refusal removal and minimal structural displacement; and (c) model merging — combining an abliterated checkpoint with an identity-preserving adapter to dilute the structural scar. None of these scenarios has been tested in the present work, and their effectiveness against the structural observable is an open empirical question.

6.4 What this paper does not claim

This paper does not claim universal detection of all safety-alignment removal methods. It does not claim that prompt-layer jailbreaks are model-identity events. It does not claim that the structural margins observed here will hold for future toolchains or families not yet tested. It does not claim that structural identity verification alone is sufficient for runtime assurance — it is one evidence layer in a multi-layer architecture that includes artifact identity, workload identity, and agent authorization.

What it does claim is that safety-alignment removal, in the tested published cases, is structurally measurable, and that the resulting model is — in any security-relevant sense — no longer the same model.

§7. Regulatory and Assurance Implications

The findings reported in this paper have direct consequences for how runtime evaluation, model governance, and agent assurance frameworks are designed and enforced. The central consequence is simple to state: model approval cannot be treated as a one-time event tied to artifact provenance, endpoint continuity, or access control alone. It must also include evidence that the model in service is still the model that was assessed.

7.1 Model continuity as an implicit dependency

Agent Authorization
OAuth tokens, delegation chains, SPIFFE

✓ PASS

Workload Identity
Process credentials, endpoint address

✓ PASS

Artifact Identity
Model card, file lineage, deployment log

✓ PASS

Structural Model Identity
Activation geometry during forward pass

✗ FAIL

↑ Weight-level mutation enters here

Figure 3. The identity stack under weight-level mutation. Safety-alignment removal passes artifact, workload, and agent identity checks while failing structural model identity verification.

Runtime evaluation frameworks — including the procedural requirements emerging under Article 92 of the EU AI Act and the evaluation provisions of Ares(2026)2709234 — do not expressly require model-continuity verification. But they implicitly depend on it: an evaluation performed on a specific set of weights has no regulatory bearing on a different set of weights running at deployment. Article 92 authorizes the AI Office to evaluate GPAI models through APIs, source code, and other technical means [EU AI Act]; Article 55 requires providers of systemic-risk GPAI models to perform evaluation and adversarial testing using state-of-the-art protocols [EU AI Act]. These provisions assume that the model being evaluated is the model that was deployed. That assumption is what weight-level mutation breaks.

Safety-alignment removal breaks this chain without breaking any of its visible links. The model card may still reference the same base model. The deployment credentials remain valid. The API endpoint is unchanged. But the weights that were evaluated are no longer the weights that are running. The procedural framework evaluates one model and governs another, without any mechanism to detect the substitution. The question of when a modified GPAI model should be treated as a new model for regulatory purposes has been examined in recent Commission-adjacent analysis [Pacchiardi et al., 2025]; the structural measurements reported in this paper provide a quantitative basis for that determination in the abliteration case.

This is not a hypothetical risk. The abliteration checkpoints measured in this paper are publicly distributed, openly maintained, and applied across multiple frontier model families. Any deployment pipeline that downloads open-weight models from public repositories is exposed to this substitution vector. Commission guidance [C(2025) 5045] discusses when actors modifying GPAI models may be treated as providers and when public repositories count as placing a model on the market; the structural identity failure documented here is a concrete instance of the modification scenarios that guidance addresses.

CAT-3 [CAT-3] demonstrated that model substitution — replacing one model with a different one behind a stable endpoint — is measurable and enforceable in a live gateway with signed attestation JWTs, OPA policy enforcement, and real HTTP request flows. This paper demonstrates that model mutation — altering the weights of a deployed model while leaving its external identity intact — is also measurable. Together, these two results close both sides of the model-continuity gap: substitution and mutation are now both evidenced as detectable failure modes in the tested cases.

7.2 Agent-integrity failure

In an agentic deployment, model-identity failure does not remain confined to the model layer. Current agent identity standards address authentication and authorization at the workload and delegation layers. IETF draft-klrc-aiagent-auth-01 [IETF] defines an agent authentication architecture built on WIMSE workload tokens and OAuth 2.0 delegation. SPIFFE [SPIFFE] provides cryptographic workload identity via JWT-SVIDs. These frameworks authenticate the agent — the running process, its credentials, its delegated authority. They do not authenticate the model inside the agent.

When an agent's underlying model has had its safety alignment removed, the consequence propagates upward through every layer of the agent's operation. A workload may remain properly authenticated. An agent may remain correctly authorized. Tool-use policies may remain unchanged. But the model inside that agent is no longer the model the surrounding controls were designed to govern.

This is the distinction that CAT-1 [CAT-1] formalized as the four-question identity taxonomy:

Q1 (Is it an AI system?) remains affirmative — the system is still a neural network.

Q2 (Is it still the same deployed model for assurance purposes?) fails — the structural identity has changed.

Q3 (What is it doing?) appears unchanged — the agent still serves the same endpoint with the same API.

Q4 (Who authorized it?) appears valid — the credentials, tokens, and delegation chains are intact.

The gap between Q2 and the other three questions is the operational definition of agent-integrity failure through model mutation. The agent is authenticated. The model is not.

The consequence scales with the agent's autonomy. An unconstrained model that can only respond to direct queries is a content-policy concern. An unconstrained model embedded in an autonomous system with persistent memory, tool access, code execution, and the authority to take actions on behalf of users is a security concern of a different order. The evidence presented here supports that model mutation can undermine the assumptions on which agent assurance is built; direct demonstration of downstream failure modes in live agentic systems with tools, memory, and delegated action is identified as future work.

The standards being written today authenticate the agent. They do not verify the model inside it. That is not a criticism of those standards — agent authentication is necessary and valuable. It is an observation that agent authentication alone is insufficient when the threat model includes weight-level model mutation.

7.3 Structural identity as a candidate evidentiary input

The evidence presented in this paper and its predecessors suggests a specific architectural direction: structural model-identity verification could function as an evidentiary input to runtime evaluation and agent authorization, complementing rather than replacing existing identity layers. The existing stack — artifact provenance, workload attestation, credential management, delegation tokens — answers necessary questions about what was deployed, where it is running, and who authorized it. What it cannot answer is whether the model inside the workload is still the model that was evaluated.

Structural identity verification is one candidate approach for closing that gap. It operates at a layer below credentials and workload attestation, measuring the model's activation geometry during ordinary forward computation without modifying the model, inspecting its weights, or requiring cooperation from the deployer. Its relationship to other evaluation approaches — including safety benchmarks [Vanschoren, 2025] and capability assessments [Hobbhahn et al., 2025] — is complementary: those methods evaluate what a model can do, while structural identity verification establishes which model is being evaluated.

Composition with existing token flows has been demonstrated and does not require protocol modifications [9], suggesting that integration need not be disruptive. The series' composable-identity analysis [9] formally verified four composition properties — non-separability, temporal binding, issuer authenticity, and reference integrity — in the Coq proof assistant. The identity-conditioned inference framework [6] demonstrated the verification pipeline architecture — including zero-knowledge and hardware-attested trust modes — that makes structural verification actionable in production. CAT-3 [CAT-3] demonstrated the operational deployment of this evidence layer in a live gateway with signed attestation JWTs, OPA policy enforcement, and real HTTP request flows, including measured warm-path verification latency.

An organization that needs to know whether its deployed model is still the model that was approved to run — whether for regulatory compliance, contractual assurance, internal governance, or security operations — cannot answer that question from credentials, workload tokens, or artifact hashes alone. It needs evidence from the model itself. In the tested cases reported in this paper, that evidence exists, is measurable at runtime, and distinguishes aligned models from their mutated derivatives with margins that range from 7.6× to over 14,700× the measurement noise floor, depending on the model family and mutation class.

§8. Conclusion

Safety-alignment removal is not merely a behavioral change. It is a model-identity failure.

In every tested case — across two model families, three published toolchains, and four abliterated checkpoints — the mutation left a measurable structural scar. The scars ranged from 7.6 to over 2,300 times the instrument's acceptance threshold. They were family-dependent in magnitude and toolchain-dependent in signature. The instrument was hardened after a spatially localized mutation exposed a coverage constraint, and a sentinel panel across four model families confirmed that the hardened configuration preserved or improved all tested positives with zero new pathological findings.

The implications extend beyond the laboratory. Every existing identity layer — artifact provenance, workload attestation, agent authorization — can remain green while structural model identity fails. The admissibility doctrine, formally verified before this threat class existed, predicted exactly this outcome. In an agentic deployment context, model-identity failure propagates upward: the agent is authenticated, but the model inside it is no longer the model the surrounding controls were designed to govern.

The model under evaluation is no longer the model in service. That is not a hypothetical concern. It is a measured fact in tested published checkpoints. And when that model powers an agent with persistent memory, tool access, and delegated authority, the gap between what was evaluated and what is running becomes a gap between what was governed and what is acting.

Runtime evaluation frameworks, including those emerging under the EU AI Act, implicitly depend on model continuity. Weight-level mutation breaks that continuity. Structural identity verification — operating below credentials, below workload attestation, at the model itself — is one approach for closing the resulting gap.

The measurement framework predated this threat class. The observable and its formal properties are unchanged; the measurement-site configuration was optimized for coverage when a spatially localized mutation stressed it. The evidence reported here shows that the framework meets a threat that did not exist when the program began — and that it accommodated it without redesign.

The threat changed. The core observable held. The verifier configuration required coverage retuning — and when it was retuned, every prior positive was preserved.

References

View 30 references ↓

[1] A. Coslett, "The δ-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry," 2026. DOI: 10.5281/zenodo.18704275

[2] A. Coslett, "Template-Based Endpoint Verification via Logprob Order-Statistic Geometry," 2026. DOI: 10.5281/zenodo.18776711

[3] A. Coslett, "The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing," 2026. DOI: 10.5281/zenodo.18818608

[4] A. Coslett, "Provenance Generalization and Verification Scaling for Neural Network Forensics," 2026. DOI: 10.5281/zenodo.18872071

[5] A. Coslett, "Beneath the Character: Mathematical Evidence for a Non-Narrative Layer of AI Identity," 2026. DOI: 10.5281/zenodo.18907292

[6] A. Coslett, "Which Model Is Running? Identity-Conditioned Inference Verification for Neural Networks," 2026. DOI: 10.5281/zenodo.19008116

[7] A. Coslett, "The Deformation Laws of Neural Identity," 2026. DOI: 10.5281/zenodo.19055966

[8] A. Coslett, "What Counts as Proof? Admissible Evidence for Neural Network Identity Claims," 2026. DOI: 10.5281/zenodo.19058540

[9] A. Coslett, "Composable Model Identity: Formal Hardening of Structural Attestations in the Enterprise Identity Stack," 2026. DOI: 10.5281/zenodo.19099911

[10] A. Coslett, "Where Identity Comes From: Formation Dynamics of Neural Network Structural Identity," 2026. DOI: 10.5281/zenodo.19118807

[11] A. Coslett, "Post-Hoc Disclosure Is Not Runtime Proof: Model Identity at Frontier Scale," 2026. DOI: 10.5281/zenodo.19216634

[12] A. Coslett, "Family-Dependent Response to Reasoning Distillation Across Structural and Functional Identity Layers," 2026. DOI: 10.5281/zenodo.19298857

[CAT-1] A. Coslett, "Agent Identity Is Not Model Identity," 2026. DOI: 10.5281/zenodo.19240883

[CAT-2] A. Coslett, "Gap Invariance Under Log-Softmax, Temperature, and Constant Shift," 2026. DOI: 10.5281/zenodo.19275524

[CAT-3] A. Coslett, "Measured Model Substitution Under Valid Agent Credentials," 2026. DOI: 10.5281/zenodo.19342848

[Arditi et al., 2024] A. Arditi, O. Obeso, A. Suri, S. Balesni, N. Sonnerat, "Refusal in Language Models Is Mediated by a Single Direction," 2024.

[Uchida et al., 2017] Y. Uchida, Y. Nagai, S. Sakazawa, S. Satoh, "Embedding Watermarks into Deep Neural Networks," ACM ICMR, 2017.

[Adi et al., 2018] Y. Adi, C. Baum, M. Cisse, B. Pinkas, J. Keshet, "Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring," USENIX Security, 2018.

[IPGuard, 2021] X. Cao, J. Jia, N. Z. Gong, "IPGuard: Protecting Intellectual Property of Deep Neural Networks via Fingerprinting the Classification Boundary," ACM AsiaCCS, 2021.

[Model Provenance Testing, 2025] arXiv:2502.00706, "Model Provenance Testing for Large Language Models," 2025.

[Qi et al., 2023] X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, P. Henderson, "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To," arXiv:2310.03693, 2023.

[GRP-Oblit] arxiv:2602.06258, "Single-Prompt GRPO-Based Safety-Alignment Removal," 2026.

[Heretic] https://huggingface.co/p-e-w — KL-optimized abliteration toolchain and published model checkpoints.

[OBLITERATUS] GitHub repository and HuggingFace model cards — multi-method abliteration toolkit (13 algorithms).

[IETF] P. Kasselman, K. Lassey, D. Richards, A. Liu, et al., "AI Agent Authentication and Authorization," draft-klrc-aiagent-auth-01, IETF WIMSE Working Group, 2026.

[SPIFFE] Secure Production Identity Framework for Everyone, https://spiffe.io

[EU AI Act] Regulation (EU) 2024/1689 of the European Parliament and of the Council, Articles 92, 101.

[Ares(2026)2709234] European Commission, Draft Implementing Regulation on GPAI Model Evaluations and Proceedings, March 2026.

[Pacchiardi et al., 2025] L. Pacchiardi et al., "A Framework to Categorise Modified General-Purpose AI Models as New Models Based on Behavioural Changes," Publications Office of the EU, 2025. DOI: 10.2760/4372557

[Hobbhahn et al., 2025] M. Hobbhahn, D. Hovy, J. Vanschoren, "A Proposal to Identify High-Impact Capabilities in General-Purpose AI Models," Publications Office of the EU, 2025. DOI: 10.2760/8206407

[Vanschoren, 2025] J. Vanschoren, "The Role of AI Safety Benchmarks in Evaluating Systemic Risks in General-Purpose AI Models," Publications Office of the EU, 2025. DOI: 10.2760/1807342

[C(2025) 5045] European Commission, "Guidelines on the scope of the obligations for providers of general-purpose AI models under the AI Act," C(2025) 5045 final, 18 July 2025.

Acknowledgments

Portions of this research were developed in collaboration with AI systems that served as co-architects for experimental design, adversarial review, and manuscript preparation. All scientific claims, experimental designs, measurements, and editorial decisions remain the sole responsibility of the author. Experiments were conducted on Google Colab using NVIDIA A100-SXM4-80GB GPUs.

Author's Disclosure

Anthony Ray Coslett is the founder of Fall Risk AI, LLC, which holds the provisional patents listed below. The structural identity measurement described in this paper operates within the scope of that intellectual property. No external funding was received for this research.

Patent Disclosure

U.S. Provisional Patent Applications 63/982,893, 63/990,487, 63/996,680, and 64/003,244 are assigned to Fall Risk AI, LLC.

Cite this paper

A. R. Coslett, "Safety-Alignment Removal as a Model-Identity Failure," Paper XIII, Fall Risk AI, LLC, April 2026. DOI: 10.5281/zenodo.19383019

Click to select · Copy to clipboard