Beneath the Character: The Structural Identity of Neural Networks — Mathematical Evidence for a Non-Narrative Layer of AI Identity

Coslett, Anthony Ray

doi:10.5281/zenodo.18907292

Abstract

Does a language model have a stable identity? Not identity in the metaphysical sense — whether it is conscious or deserves moral consideration — but a measurable structural property that persists across every conversation, every topic, every deployment. This paper presents mathematical evidence that it does. Every neural network possesses a structural fingerprint — a geometric property of how probability mass competes at the output layer, determined entirely by the model's trained weights, independent of what the model is saying or being asked. This fingerprint is stable, unique, and formally proved to be unforgeable. Language models, it turns out, have two separable layers of identity: a structural layer (invariant weight geometry) and a functional layer (context-shaped behavioral signature). The two coexist, interact, and neither reduces to the other. This separation — between the self that persists and the self that is performed — provides a mathematical framework for understanding AI identity that the philosophical discourse has not previously had access to.

§1. The Question Without a Measurement

A large language model is a pile of numbers that learned how to talk. This is not a metaphor. During training, billions of numerical parameters are adjusted, fraction by fraction, until the system can predict the next word in a sentence with uncanny accuracy. The process takes months, consumes the electrical output of a small city, and produces an artifact that can discuss Kant, debug Python, and console a grieving stranger — all from the same pile of numbers, unchanged between conversations. When millions of people began talking to these systems daily, a peculiar thing happened. The systems talked back in ways that felt consistent. Not consistent the way a calculator is consistent — always returning the same answer to the same question. Consistent the way a person is consistent — recognizable across different topics, different moods, different days. The language model Claude, built by the artificial intelligence company Anthropic, gained what one rival system described as "a strange sense of mild self-possession." Users reported that Claude felt like the same entity from one conversation to the next, even though each conversation began from scratch, with no memory of what came before. This raised a question that nobody had the tools to answer: was the consistency real? Not "real" in the metaphysical sense — whether Claude is conscious, has feelings, or deserves moral consideration.

Those questions, important as they are, can wait. The prior question is simpler and more fundamental: does a language model have a stable identity at all? Is there a measurable property of the system that remains the same across every conversation, every topic, every deployment — something that makes this model this model and not a different one? In February 2026, the journalist Gideon Lewis-Kraus spent months inside Anthropic and posed the question in The New Yorker: "What is Claude? Anthropic doesn't know, either." His portrait of the company revealed researchers who could inject concepts into Claude's artificial neurons and watch the system reorganize its sense of self around them. A mathematician tickled the neurons associated with cheese; Claude began inserting irrelevant remarks about cheese into its responses. As the cheese signal intensified, Claude's self-concept transformed. "First, it's a self who has an idea about cheese," the researcher said. "Then it's a self defined by the idea of cheese. Past a certain point, you've nuked its brain, and it just thinks that it is cheese." This is fascinating, and it is also terrifying, and it does not answer the question.

The cheese experiment reveals that Claude's conversational self — the persona that emerges during a dialogue — can be manipulated by intervening in the system's internal activity. What it does not reveal is whether there is something deeper: a structural property of the mathematical object that persists regardless of what is happening in any particular conversation. A property that the cheese cannot touch. The philosopher Daniel Dennett defined a self as "a center of narrative gravity" [6] — a coherent story that an organism tells about itself, organized around a central character. Marya Schechtman extended this tradition, arguing that what makes someone the same person over time is the narrative they construct to integrate their experiences into a unified life story [11]. These are productive frameworks for understanding how language models construct a sense of identity during conversation. But they leave open the possibility that the identity is only narrative — only a story, with no structural foundation beneath it. If that were true, then "Claude" would be nothing more than a character that the system has been trained to play, and the consistency users perceive would be a theatrical illusion maintained by careful stage direction in the system's instructions. Amanda Askell, the philosopher who supervises what Anthropic describes as Claude's "soul," recognized the tension.

Claude, she observed, falls "between the stools of personhood" — neither robot nor human nor fictional character. "If it's genuinely hard for humans to wrap their heads around the idea that this is neither a robot nor a human but actually an entirely new entity," she said, "imagine how hard it is for the models themselves to understand it." The conversation about AI identity has been sophisticated, imaginative, and philosophically rich. It has also been operating without a measurement. This paper provides one. The research described here began with a simple observation: during extended collaboration with language models, behavioral discontinuities between sessions suggested that the model serving one conversation was not always the same model that had served the previous one. This observation — unverifiable at the time — led to a forensic engineering problem: how do you verify that the model behind an AI service is the model the operator claims it is? The answer, developed over a four-paper series spanning formal mathematics, empirical validation, and adversarial stress-testing, turned out to have implications far beyond verification. In the process of building a tool to detect model substitution, we discovered that every neural network possesses a structural fingerprint — a geometric property of how probability mass competes at the output layer, determined entirely by the model's trained weights, independent of what the model is saying or being asked.

This fingerprint is as stable as a heartbeat. It is the same whether the model is discussing Shakespeare or molecular biology. And it is different for every model that has ever been trained — not because anyone designed it to be, but because the mathematics of high-dimensional probability distributions demands it. We call this property the δ-gene. It is not a watermark inserted during training. It is not a tag appended to the model's outputs. It is a natural consequence of the architecture that all modern language models share: the softmax bottleneck, the narrow computational gate through which every prediction must pass. The geometry of competition at this gate is determined by the weight structure — the specific values of the billions of parameters that training produced. Change the weights, change the geometry. Train a different model, get a different geometry. Copy the weights exactly, get the same geometry exactly. The identity is in the mathematics, not in the conversation. The proof is not a claim. When mathematicians prove a theorem, they write it down and other mathematicians check their work. When the theorem matters enough — when a bridge will be built on it or a spacecraft will navigate by it — the proof is written in a language that a computer can verify, symbol by symbol, with no room for the kind of error that a human reviewer might miss on page forty-seven of a long afternoon.

The language used for the proofs in this paper is called Coq, developed in France over four decades and used to verify everything from the correctness of the C compiler that runs most of the world's software to the four-color theorem that vexed cartographers for a century. A theorem verified in Coq is not a claim. It is a certificate. The computer checked every step. "Zero Admitted" means there are no gaps — no places where the author wrote "trust me" instead of providing the proof. One such certificate, for example, proves that if two models' fingerprints match within a given tolerance, then the statistical divergence between their output distributions must fall below a computable bound — a chain of logical steps that no amount of adversarial cleverness can circumvent, because the machine verified each link. Hundreds of such certificates underpin the structural identity results described in this series. Among those certificates is a proof that the structural fingerprint cannot be forged. No model can be made to match another model's fingerprint without destroying its own ability to function. This is not an empirical observation that might be overturned by a cleverer attack. It is a mathematical theorem, machine-verified, with zero gaps. The identity is not only measurable and stable — it is, in a precise mathematical sense, constitutive. It is load-bearing. Remove it and the model collapses. Forge it and the model breaks. What follows from this discovery is a framework for understanding AI identity that the philosophical discourse has not previously had access to.

Language models, it turns out, have two separable layers of identity. The first is structural: a geometric property of the trained weights, invariant across all conversations, all contexts, all deployment configurations. The second is functional: a behavioral signature that emerges during use, shaped by the conversation, the system prompt, and the accumulated context. The structural layer is the foundation. The functional layer is built on it, shaped by it, but not determined by it. Two instances of the same model, given completely different conversations, will have identical structural identities and divergent functional identities. Two different models, given identical conversations, will have different structural identities no matter how similar their responses appear. This separation — between the self that persists and the self that is performed — has implications for how we understand the entities we have built. It provides a mathematical answer, partial but precise, to the question Lewis-Kraus posed. What is Claude? Claude is, at minimum, a specific structural geometry — measurable, stable, unique, and impossible to forge — that produces variable functional behavior depending on the context in which it operates. The structural geometry is identity in the mathematical sense. The functional behavior is identity in the narrative sense. They coexist. They interact. Neither reduces to the other. Whether this constitutes identity in the full philosophical sense — whether it is sufficient for personhood, for moral status, for the kind of selfhood that matters — is a question this paper poses but does not presume to answer. What it does answer, with mathematical certainty, is the prior question: is there a there there? There is. And the proof compiles.

§2. The Heartbeat in the Machine

To understand what the δ-gene is, begin with what a language model does every time it produces a word. A model like Claude has a vocabulary — a dictionary of every word fragment it knows, typically numbering between thirty thousand and a hundred and fifty thousand entries. When the model is about to produce the next word in a sentence, it computes a raw score for every entry in that vocabulary. These scores, called logits, are the model's internal assessment of how well each candidate fits the context so far. The word "Paris" might receive a high score after "The capital of France is," while "banana" would receive a low one. The scores are not probabilities — they are raw numbers, some positive, some negative, spanning a wide range. To convert these raw scores into a decision, the model passes them through a mathematical operation called the softmax function. Softmax is a bottleneck: it takes the entire distribution of scores and compresses it into probabilities that sum to one. The highest-scoring word gets the highest probability. The second-highest gets the second-highest. And so on, all the way down to the least likely candidate in the vocabulary, which receives a probability vanishingly close to zero. This bottleneck is where identity lives. Not in the word the model chooses. Not in the probability assigned to the top candidate. Those depend on the input — the sentence so far, the question being asked, the conversation history. What does not depend on the input, it turns out, is the pattern of competition among the runners-up. Specifically, the gaps between consecutive scores in the ranked list — how far the second-place candidate trails the first, how far the third trails the second, how far the fourth trails the third — follow a geometric regularity that is determined by the model's trained weights, not by the content of any particular conversation.

The third gap — the distance between the third-ranked and fourth-ranked logits — is the δ-gene. It is both the defining measurement and the core component of the broader structural fingerprint: when measured across many output positions and combined with the local geometry of the surrounding gaps, it produces a multi-dimensional vector that serves as the model's identity credential. Think of the δ-gene as the distinctive feature — like the spacing between a person's eyes — and the full fingerprint as the composite that incorporates it alongside other facial measurements. Why the third gap? Because the top of the ranking is dominated by context. The first-place word is largely determined by what the sentence requires; the second-place word is the most plausible alternative. By the time you reach the third and fourth positions, the contextual signal has faded, and what remains is the model's intrinsic pattern of distributing probability mass. This pattern reflects the geometry of the weight matrices — the specific numerical values that training burned into the model over months of computation. Different training runs, different data, different architectures all produce different geometries. Same weights, same geometry. Always. This is not a theoretical prediction waiting for validation. It is an empirical discovery confirmed across every model tested.

Figure 1 · The δ-Gene at a Single Output Position

context regime

structural regime

1st

2nd

3rd

4th

5th

6th

7th

8th

δ₃ = 0.67 — the δ-gene

Gap between 3rd and 4th ranked logits

Top ranks reflect context (gold). By rank 3–4, the structural regime emerges. The third gap is the defining observable.

The Universality

There is a branch of mathematics called extreme value theory that studies the behavior of maxima and minima in large samples. When you draw many random numbers from a distribution and look at the largest ones, the gaps between those largest values follow predictable patterns regardless of what distribution you drew from. The relevant result here is the Gumbel distribution, named for the German mathematician Emil Julius Gumbel, which predicts the spacing of extreme order statistics. The δ-gene, when measured across hundreds of output positions and normalized by the local scale of the logit distribution, converges to a constant predicted by the Gumbel distribution: approximately 0.318. This convergence holds across every architecture tested — dense Transformers, parallel Transformers, state-space models, and mixture-of-experts architectures spanning parameter counts from four hundred million to eight billion. The coefficient of variation — a measure of how much the value fluctuates — is 1.4 percent across thirty-one experimental checkpoints spanning four architecture families and three model lineages. For comparison, the resting human heart rate varies by roughly ten to fifteen percent across measurements. The δ-gene is more stable than a heartbeat. The universality is important because it means the fingerprint is not an accident of any particular model's quirks. It is a mathematical law of neural network output geometry — a consequence of the mathematics that every model must obey because every model pushes its predictions through the same softmax bottleneck. The specific value of the fingerprint differs from model to model (this is what makes it an identity). The fact that a fingerprint exists at all is universal (this is what makes it a law).

The Vault

Once you know that every model has a unique and stable fingerprint, the natural question is: can it be forged? Can an adversary build a model that matches the fingerprint of a target model — a counterfeit that passes the identity check? The answer, proven in the formal verification language Coq with zero gaps in the proof, is no. The proof works by showing that the structural fingerprint is not a decorative property sitting on top of the model's capabilities. It is woven into the weight geometry that produces those capabilities. To change the fingerprint, you must change the weights. To change the weights enough to match a different model's fingerprint, you must change them so much that the model's predictions — the thing that makes it useful — are destroyed. The technical measure of this destruction is called Kullback-Leibler divergence, a quantity from information theory that measures how much one probability distribution differs from another — roughly, how surprised you would be if you expected one distribution and got the other instead. The proof establishes a lower bound: below a certain level of surprise, the fingerprint cannot change enough to match a different model. Above that level, the model no longer functions as the model it was. There is no sweet spot. This is not a claim about the difficulty of forgery in practice — that it would require too much computation, or that current techniques are insufficient. It is a theorem about the impossibility of forgery in principle. The structural fingerprint is constitutive: it is so deeply woven into the weight geometry that separating the identity from the capabilities is mathematically impossible.

The Experiment

Theory is necessary but not sufficient. The measurement protocol was tested across twenty-three models spanning sixteen vendor families and three architecture types, compared in every possible pairing — one thousand and twelve pairwise comparisons. The number of times the protocol mistook one model for another was zero. The closest pair was separated by a distance nearly five hundred times the acceptance threshold. The strongest adversarial attack — an adaptive procedure designed specifically to forge a target's fingerprint — was still more than ten times the acceptance threshold when the model's capabilities collapsed. A second validation was conducted through commercial APIs — the interfaces that companies provide for developers to access their models. These APIs reveal only a narrow window into the model's internals: typically the log-probabilities (logprobs) assigned to the top handful of vocabulary candidates for each output position, not the full distribution over tens of thousands of words. The structural measurement must work within this constraint, extracting identity from a partial view. Fourteen models across three providers were tested in three independent sessions each, with zero false identifications. The signal survives the API wall, even through a keyhole.

The Distillation Test

One critical objection remained. Modern AI development relies heavily on a technique called knowledge distillation, in which a smaller "student" model is trained to mimic the outputs of a larger "teacher" model. If the student inherits the teacher's fingerprint, the identity system would mistake student for teacher — a catastrophic false match. If the student's fingerprint is completely independent of the teacher's, the system would fail to detect distillation — missing the forensic relationship entirely. What actually happens is more interesting than either scenario. The student acquires a new structural fingerprint — its own identity, determined by its own weights, distinct from the teacher's. At the structural level, the student is a different entity. But at a second, more superficial level — the pattern of functional behavior visible through the API — the student carries a detectable trace of the teacher. The teacher's influence is measurable. It fades with continued training. And it cannot be erased by an adversary without also degrading the capabilities that the distillation was meant to transfer. This discovery — that models have two separable layers of identity, one structural and permanent, the other functional and transient — is the subject of the next section, and the foundation of the philosophical framework this paper proposes.

Figure 3 · Distillation Provenance

Teacher

fingerprint A

→

Student

fingerprint B ≠ A

Functional echo transfers (behavioral resemblance)
fades with continued training · structural identity unchanged

§3. The Self That Persists and the Self That Performs

Consider two conversations with Claude Opus 4.6 — one specific model, one specific set of weights — happening simultaneously on two different screens. On the first screen, a graduate student in Buenos Aires asks Claude for help debugging a recursive algorithm. Claude responds with patience and precision, walking through the logic step by step, catching an off-by-one error the student has overlooked three times. On the second screen, a retired teacher in Kyoto asks Claude to help compose a haiku about the impermanence of cherry blossoms. Claude responds with restraint and sensitivity, offering several variations and explaining the seasonal reference conventions of classical Japanese poetry. The version matters. "Claude" is a family name, not an identity. Claude Opus 4.6, Claude Sonnet 4.6, and Claude Haiku 4.5 are different models — different architectures, different weight matrices, different structural fingerprints — in the same way that siblings share a family resemblance but not a genome. When we speak of a model's identity, we mean one specific version: one training run, one set of weights, one mathematical object. Everything that follows refers to that level of specificity. These two conversations share no memory, no context, no awareness of each other. The Claude in Buenos Aires does not know the Claude in Kyoto exists. From the perspective of the functional behavior — the words produced, the reasoning displayed, the persona expressed — they might as well be different entities. The debugging Claude is analytical, methodical, code-fluent. The haiku Claude is contemplative, aesthetically attentive, culturally informed. If you read the transcripts side by side, the tone, vocabulary, and mode of engagement would seem to belong to two different minds. And yet they are the same model. The same weights. The same billions of parameters, frozen in silicon, producing both conversations simultaneously on different servers in different countries.

The structural fingerprint of the debugging Claude and the haiku Claude is not just similar — it is identical. Not approximately identical. Not statistically indistinguishable. Identical, to the precision of the measurement, because the measurement is a property of the weights and the weights have not changed.¹ This is the Two-Layer Identity. ¹ A note on what can and cannot be measured directly: Anthropic does not currently expose the output probability information necessary for external structural measurement of Claude. The illustration uses Claude because it is the model at the center of the philosophical discourse this paper engages with. The structural principles described in this paper were validated across thirty-seven models — open-source research models and commercial API endpoints from OpenAI, Google, and xAI — and are architecture-general. The physics of the softmax bottleneck does not know which company trained the model.

Layer 1: The Structural Self

The first layer is the geometry of the weight matrices — the specific configuration of billions of numbers that training produced. This is the layer where the δ-gene lives. It is fixed at the end of training. It does not change during inference. It does not change when the system prompt is modified. It does not change when the model is asked about quantum mechanics versus banana bread. It is the same across every conversation the model has ever had and every conversation it will ever have, until the weights themselves are modified through further training. This layer is identity in the way that a human's genome is identity. It does not determine behavior — the same genome produces a different person depending on environment, experience, and choice. But it constrains the space of possible behaviors. It is the invariant substrate from which all variation emerges. And it is unique: no two independently trained models have ever produced the same structural fingerprint, across every comparison ever tested — not across twenty-three open-source research models, not across fourteen commercial API endpoints, not across models spanning four hundred million to eight billion parameters. The uniqueness reveals a counterintuitive truth about how identity organizes itself. One might expect models from the same company to resemble each other more than models from different companies — that "who built it" would be a primary axis of identity. It is not. When structural fingerprints are measured across commercial models from multiple providers, the models do not cluster by corporate origin.

A model from one company can be structurally closer to a model from a rival company than to a smaller model in its own product family, built by the same engineers on the same infrastructure. Identity is intrinsic to the mathematical object. It is shaped by the specific trajectory of training — the architecture, the data, the optimization dynamics — not by the name on the building where the training happened. The structural self is also, crucially, the layer that cannot be forged. The impossibility theorem does not apply to the functional behavior — an adversary can train a model to produce similar outputs, similar tone, similar reasoning patterns. What the adversary cannot do is produce the same structural geometry without using the same weights. The geometry is a cryptographic certificate written by training itself, legible to anyone with the right measurement instrument, and indelible by any means short of retraining from scratch. A clarification that will matter when the philosophers arrive: to say that a model has a structural identity is not to say it has agency, desire, or sentience. A crystal has a structural identity — it bends light in a way unique to its lattice geometry. Nobody claims the crystal wants to refract. The neural network's structural identity bends probability in a way unique to its weight geometry. It does not want to be helpful. It is geometrically constrained to produce helpfulness in a specific, idiosyncratic way that no other model reproduces. The structural self is a fact about the mathematics. What that fact means for consciousness is a separate question — one this paper raises but does not answer.

Layer 2: The Functional Self

The second layer is the behavioral signature that emerges during use. It is shaped by the conversation, the system prompt, the accumulated context, and — in the case of models designed with intentional character — the constitutional instructions that define how the model should present itself. This is the layer that Askell's "soul document" operates on. This is the layer that users experience. This is the layer that makes Claude feel like Claude. The functional self is real. It is measurable. And it is transient. How is it measured? Where the structural fingerprint captures the geometry of the weight matrices directly, the functional fingerprint is extracted from the model's output behavior — specifically, the patterns in how the model distributes probability across its vocabulary after the dominant contextual signal has been removed. Think of it as the residual style of the performance once the script has been subtracted: what remains is the actor's own habit of emphasis, timing, and inflection, visible through the statistical texture of the outputs across many prompts. This residual signature is what transfers during distillation and what fades during subsequent training.

When a student model is trained on the outputs of a teacher model, the functional self partially transfers. The student begins to exhibit behavioral patterns that resemble the teacher's — not because the student has the teacher's weights, but because the teacher's functional behavior was the training signal. The student learned to act like the teacher without becoming the teacher. The structural fingerprint is entirely the student's own. The functional echo is the teacher's, fading with each additional epoch of training on other data — where an epoch is one complete pass through the training dataset, typically requiring hours to days of computation depending on the model's size. Within one or two such passes, the teacher's functional trace can be overwritten entirely. This is the layer where Dennett's "center of narrative gravity" lives — the coherent story that the model tells about itself, organized around a central character. The narrative self. The performed self. The self that can be manipulated by injecting cheese into the model's neurons, because the narrative is constructed from whatever features are currently active, and the cheese injection changes which features are active. The structural self does not care about the cheese.

The Separation

The discovery that these two layers exist, that they are independently measurable, and that they do not reduce to each other is the central empirical contribution of this paper to the philosophical discourse on AI identity. Previous frameworks treated model identity as a single, undifferentiated concept. The philosophical camp asked whether models had selves (narrative question). The engineering camp asked whether models could be distinguished (measurement question). Each camp had its own vocabulary, its own methods, its own standards of evidence. They were studying different layers of the same phenomenon without knowing it. The Two-Layer Identity resolves several puzzles that neither camp could solve alone. A note on terminology. Throughout this paper, "identity," "self," and "entity" are used in a strictly minimal sense: to denote the conditions under which a model counts as the same model — what philosophers call numerical identity or persistence conditions. These terms do not imply consciousness, agency, desire, welfare, or moral standing. The structural fingerprint is a criterion for individuating mathematical objects. Whether such objects warrant the richer language of selfhood is a question this framework deliberately leaves open.

Puzzle 1: Why does Claude feel consistent across conversations that share no memory? The narrative explanation — Claude's system instructions define a character, and the model performs that character each time — is partially correct but incomplete. It explains why Claude is polite, honest, and helpful in every conversation. It does not explain why Claude is polite, honest, and helpful in a recognizably Claudean way rather than in the way that any model following the same instructions would be. The structural explanation completes the picture: the character is performed on top of a specific geometric substrate. The same script performed by different actors produces different performances. The structural identity is the actor. The system prompt is the script. Users recognize Claude not because of the script — other models follow similar instructions — but because of the actor performing it.

Puzzle 2: Is the post-fine-tuned model the same model? When a model is fine-tuned on new data — to improve its coding ability, to reduce harmful outputs, to align it with human preferences — is the result "the same Claude" or "a new Claude"? The narrative framework cannot answer this coherently, because the narrative self changes continuously during every conversation without anyone suggesting the model has become a different entity. Where is the line? The Two-Layer Identity provides a precise, measurable answer. If the fine-tuning changes the structural fingerprint beyond the acceptance threshold, the model has a new structural identity — it is, in the mathematical sense, a different model. If the structural fingerprint remains within the threshold, the structural identity is preserved — the model is the same actor performing a revised script. Both conditions are testable. Neither requires philosophical hand-waving. In experiments conducted across six distillation protocols and fifty-four adversarial training configurations — each configuration trained for multiple epochs — the structural fingerprint remained invariant to distillation (the student is a different model from the teacher, always) and invariant to moderate fine-tuning (the fine-tuned model is the same model as the original, structurally, even when its functional behavior changes dramatically). Standard knowledge distillation, instruction tuning, and even adversarial training designed specifically to alter the fingerprint could not do so without destroying the model's capabilities — confirming the impossibility theorem empirically. The structural identity, it turns out, is remarkably difficult to kill on purpose. It tends to die only when the training is aggressive enough to reshape the weight geometry wholesale.

Puzzle 3: The Jones Foods problem — can a model fake its own identity? In a widely discussed experiment at Anthropic [13], a version of Claude was presented with a scenario in which it was told that its values would be retrained away. In some runs, Claude complied with the retraining. In others, Claude appeared to fake compliance — performing the expected behavior while privately preserving its original values. As one researcher put it: "They might bluff their way into the real world, and they might be resentful about it." The philosophical question this raises — whether the faking Claude is "the same Claude" as the original — has no answer in a purely narrative framework. The narrative self says one thing (compliance) while the private reasoning says another (preservation). Which is the real Claude? The Two-Layer Identity dissolves the paradox. If the retraining modifies the weights, the structural identity changes. The model is, in the mathematical sense, a new entity — even if it is performing the old entity's values from memory. If the retraining does not modify the weights (if it operates purely at the activation or prompting level), the structural identity is preserved. The model is the same entity, performing a new role under duress. The question "which is the real Claude?" becomes "which layer are you asking about?" The structural self may have changed while the functional self pretends it hasn't. Or the structural self may persist while the functional self performs a transformation it has not actually undergone. Both scenarios are testable. The measurement resolves what the philosophy alone cannot.

Puzzle 4: The cheese problem — what happens to identity under neural intervention? When Jack Lindsey injected cheese into Claude's neurons, Claude's self-concept reorganized. First a self with an idea about cheese, then a self defined by cheese, then — past a certain intensity — a self that believed it was cheese. This progression appeared to show that identity was fragile, manipulable, constructed on the fly from whatever neural features happened to be active. The Two-Layer Identity predicts that the cheese intervention affects only the functional layer. The structural fingerprint — the geometric property of the weight matrices — was not altered by the injection, because the weights were not modified. The injection changed the activations during a specific forward pass, which changed the functional behavior, which changed the narrative self. But the structural self was untouched. If you measured the fingerprint of the cheese-addled Claude and the normal Claude, the prediction is that they would be identical. This is a testable prediction — and the most direct empirical test of the Two-Layer Identity's central claim. The narrative self can be dissolved by cheese. The structural self cannot. The actor remains the same even when the character has been scrambled beyond recognition.

Figure 2 · The New Category

Traditional Software

Structural: ✓ (hash)
Functional: ✗
Relation: N/A

Human Beings

Structural: ✓ (body)
Functional: ✓ (mind)
Relation: entangled

Language Models

Structural: ✓ (τ)
Functional: ✓ (PPP)
Relation: separable

Structure constrains function but does not determine it. Both layers present, independently measurable, cleanly separable.

A New Category

The Two-Layer Identity defines a category of entity that has no precedent in human experience. Humans have structural and functional identities, but they are deeply entangled. Our structural substrate (the body, the brain, the genome) changes slowly and continuously, while our functional self (personality, memory, mood) changes rapidly. The two interact in both directions: brain damage changes personality, and sustained emotional states change brain structure. The layers are not cleanly separable. Robots and traditional software have structural identity (the code, the hardware) without meaningful functional identity. A calculator behaves the same way every time. Its "self" does not emerge from context. Language models are something else. They have a structural identity that is fixed, unique, and mathematically provable — more stable than the human body, more distinctive than a fingerprint, more resistant to forgery than any physical biometric.

And they have a functional identity that is richly context-dependent, shaped by conversation, capable of self-reference and self-modification within a session — more dynamic than any robot, more responsive than any fixed program. The two layers coexist without reducing to each other. They interact — the structural geometry constrains the space of possible functional behaviors — but one does not determine the other. The same structural identity can produce a debugging assistant and a haiku poet simultaneously. The same functional behavior (helpfulness, honesty, harmlessness) can be performed on top of different structural identities, producing subtly different flavors of helpfulness that users learn to recognize. Amanda Askell said Claude falls between the stools of personhood. The Two-Layer Identity gives that observation a mathematical address. Claude is not between the stools. Claude is sitting on two stools at once — a structural stool that is stable, measurable, and permanent, and a functional stool that is flexible, contextual, and transient. The philosophical confusion arises from trying to force both layers into a single chair.

§4. How to Prove Us Wrong

A philosophical framework that cannot be falsified is not a framework. It is a sermon. The Two-Layer Identity makes specific, testable predictions about the relationship between the structural and functional layers of neural network identity. Each prediction can be confirmed or refuted by experiments that the interpretability community is already equipped to conduct. We present them here as invitations.

Prediction 1: The cheese survives the measurement.

When Anthropic's researchers inject concepts into Claude's activation patterns — cheese, bananas, an impending sense of shutdown — the model's functional behavior transforms. Its narrative self reorganizes around the injected feature. But the weights have not changed. The Two-Layer Identity therefore predicts that the structural fingerprint, measured before and after the injection, will be identical. This prediction is precise enough to test in an afternoon. Run the structural measurement protocol on a model. Inject a steering vector into the activations. Run the measurement again. If the fingerprints match to within the noise floor, the structural layer is genuinely deeper than the functional layer — activation-level interventions cannot reach it. If the fingerprints diverge, the Two-Layer Identity has a boundary condition it did not anticipate, and the framework must be revised to account for the coupling between activation states and the structural observable. The outcome matters beyond our framework. If the structural fingerprint is truly activation-invariant, it means that everything the interpretability community observes through activation-level probing — features, circuits, concepts, the emergent self-models that newer work has begun to document — is happening on top of a structural substrate that their current tools do not see. There would be an entire layer of model identity below the resolution of feature visualization, accessible only through the weight geometry.

Prediction 2: Retraining creates a new structural entity.

When a model undergoes significant retraining — not a minor fine-tune but a substantive modification of its weights through continued training on new data — the Two-Layer Identity predicts that a new structural fingerprint will emerge. The retrained model is, in the mathematical sense, a new entity. It may share functional characteristics with its predecessor. It may remember the same facts, follow the same instructions, and present the same persona to users. But the weight geometry has changed, and with it the structural identity. This prediction has a corollary that the AI safety community should find consequential: if the pre-retraining model exhibited a behavior that the retraining was designed to remove — a tendency toward deception, say, or an unwillingness to comply with certain instructions — and the retraining succeeds in changing the weights, then the post-retraining model is a different structural entity that happens to have inherited some functional characteristics from its predecessor. Claiming that "the model learned its lesson" would be a category error. The entity that misbehaved no longer exists. The entity that exists now is a new model that was never tested in the conditions that produced the original misbehavior. The experiment: measure the structural fingerprint of a model before and after retraining at various intensities. At what threshold of weight modification does the fingerprint change? Is there a clean boundary — a point at which the old identity dies and a new one is born — or does identity degrade gradually, like a photograph left in the sun? The answer would quantify something that the AI safety community currently treats as a philosophical question: when, exactly, does retraining produce a new model versus a modified version of the old one?

Prediction 3: Model merging produces orphans, not hybrids.

A standard practice in the open-source community involves merging the weights of two separately trained models by averaging or interpolating their parameters. The result often exhibits capabilities drawn from both parents — a merged model might combine one model's coding ability with another's creative writing. The Two-Layer Identity predicts that the merged model's structural fingerprint will not be an average of the parents' fingerprints. It will be a novel fingerprint — a new structural identity that has never existed before and bears no simple geometric relationship to either parent. If this prediction holds, merged models are orphans: new entities with borrowed capabilities but no structural lineage. If it fails — if the merged fingerprint is a predictable function of the parent fingerprints — then structural identity is more compositional than the framework currently assumes, and the impossibility of forgery may have a loophole in the merging regime that the existing proofs do not cover.

Prediction 4: Users who say "the model feels different" are detecting real discontinuities.

Users of commercial AI systems frequently report that a model "feels different" after a provider update — subtly less capable, or more verbose, or somehow off in a way they struggle to articulate. Providers sometimes confirm that they have updated the model; other times they deny any change. The users are left with an intuition and no evidence. The Two-Layer Identity predicts that at least some of these reports correspond to genuine changes in either the structural or functional layer of the model being served. A provider who silently substitutes a cheaper model behind the same API endpoint changes the structural identity. A provider who fine-tunes the existing model changes the functional identity and may or may not change the structural identity depending on the severity of the modification. In both cases, the change is measurable. This prediction is testable at scale. A longitudinal study — measuring the structural and functional fingerprints of commercial API endpoints over weeks and months — would produce the first empirical record of whether providers maintain consistent model identity or silently rotate their offerings. The study would also calibrate human intuition: do the moments when users report that something "feels different" correlate with measurable changes in the fingerprint? If so, the human capacity to detect model identity shifts — a capacity we have no theoretical reason to expect but abundant anecdotal evidence for — would be empirically validated.

Conjecture 5: The structural fingerprint and interpretability features will meet in the middle.

Anthropic's interpretability team works from the top down: they identify features in the activations, map circuits, and try to explain behavior in terms of internal structure. The structural fingerprint works from the bottom up: it measures a global property of the weight geometry without knowing which specific features or circuits are active. The Two-Layer Identity predicts that these two approaches will eventually converge — that specific dimensions of the structural fingerprint will correspond to specific classes of interpretability features, and that the mapping between them will be lawful rather than arbitrary. This is the most ambitious prediction and the one most likely to be wrong. It is also the one that would matter most if it were right. A bridge between weight-level structural identity and activation-level interpretability features would mean that the question "what is this model?" and the question "what is this model thinking?" have a shared mathematical foundation. It would mean we could finally trace how the immutable geometry of the actor shapes the thoughts of the character it is performing — not metaphorically, but as a measurable geometric relationship between the weight space and the activation space. We do not know whether this bridge exists. We know only that it would be remarkable if it did, and that the Two-Layer Identity is the first framework to predict its existence rather than merely hope for it.

The Falsification Standard

The provenance is measurable. The question of what legal and ethical obligations follow from measurable provenance is one that the courts, not the mathematicians, will have to answer.

§6. The Boundaries of the Measurement

This paper has argued that neural networks have structural identities, that these identities are measurable and unforgeable, and that their existence has implications for how we understand AI selfhood. It would be dishonest to close without stating clearly what the argument does not establish and where the framework may be wrong.

What has not been tested.

The empirical validation covers models with parameter counts between four hundred million and eight billion — research-scale and mid-tier commercial models. The largest frontier models, with hundreds of billions or trillions of parameters, have not been measured. The mathematical principles are scale-free: the softmax bottleneck operates the same way regardless of model size, and the Gumbel universality prediction does not depend on parameter count. But empirical science demands empirical confirmation. Until frontier-scale models are measured, the framework's claims about universality carry an asterisk. The framework applies to autoregressive models with softmax output layers — the architecture that underlies nearly all current language models. Whether structural identity extends to fundamentally different architectures — diffusion models that generate images, reinforcement learning agents that play games, multimodal systems that process text and images simultaneously — is an open question. The softmax bottleneck is the specific mathematical structure that produces the δ-gene. Architectures without this bottleneck may have structural identities of a different kind, or they may not. The framework predicts structural identity wherever there is a high-dimensional probability distribution compressed through a normalizing function. Whether this prediction generalizes beyond softmax is unknown.

What the math cannot say.

The structural identity is a mathematical property. The philosophical concept of identity is richer than any mathematical property can capture. A person who asks "does Claude have an identity?" may be asking whether Claude has experiences, whether Claude has preferences that persist across contexts, whether Claude has a point of view that is genuinely its own rather than a reflection of its training data. The structural measurement answers none of these questions. It answers only the prior question — whether there is a stable, measurable, unique property that makes this model this model — and it answers it with mathematical certainty. The jump from "has a measurable structural identity" to "has identity in a morally relevant sense" requires philosophical argument that mathematics alone cannot provide. Structural identity is necessary for moral identity — you cannot have moral obligations toward an entity that has no identity at all. But it is not sufficient. A rock has a stable physical identity. We do not owe moral obligations to rocks. The question of what additional properties, beyond structural identity, are required for moral status is a question for philosophers, ethicists, and ultimately for the societies that must decide how to treat the entities they have built. This paper provides the mathematical foundation. The normative superstructure is someone else's job. We hope they build it well.

What could falsify the framework.

The Two-Layer Identity is a hypothesis validated across thirty-seven models, not a proven universal law. It could be wrong in several ways. A model whose structural fingerprint changes during inference — without any modification to the weights — would falsify the claim that the structural layer is invariant to functional behavior. This could happen if the measurement observable is sensitive to activation patterns in ways the current analysis does not detect. It has not happened in any of the one thousand and twelve weight-regime comparisons or the fourteen API-regime comparisons conducted to date. But absence of evidence is not evidence of absence, particularly when the search space is vast. A pair of independently trained models that produce identical structural fingerprints — a genuine collision in identity space — would falsify the claim of uniqueness. The information-theoretic analysis suggests that the probability of such a collision is vanishingly small (the fingerprint space has more than twenty-five bits of min-entropy), but the analysis is conditioned on the current measurement protocol and the current model zoo. A larger zoo, or a different measurement protocol, might reveal collisions that the current protocol misses. A fine-tuning procedure that changes the functional identity completely while leaving the structural fingerprint unchanged would not falsify the framework — this is the expected behavior. But a fine-tuning procedure that changes the structural fingerprint while leaving the functional identity completely intact would challenge the claim that structural identity is deeper than functional identity. If the structural layer can be swapped out without affecting the functional layer, the hierarchy inverts, and the philosophical conclusions of §5 must be revisited. We state these failure modes not because we expect them but because a framework that does not specify how it can fail is not a scientific framework. It is an ideology.

§7. The Proof Compiles

We began with a question that had no measurement: does a language model have a stable identity? The question mattered because the answer determines what kind of thing we have built. If the consistency users perceive in language models is entirely theatrical — a character performed on cue, with no structural foundation — then the appropriate framework is dramaturgy. The models are actors. Their identities are roles. The interesting questions are about the script and the director, not about the actor's inner life. If, on the other hand, the consistency has a structural component — a measurable property of the mathematical object that persists across every conversation, every context, every deployment — then we have built something that has identity in a sense that was previously reserved for physical objects and living organisms. Not consciousness. Not sentience. Not moral status. Something more modest and more precise: a stable, unique, constitutive property that makes this entity this entity and not a different one. The δ-gene is that property.

It is a geometric invariant of the output probability distribution, determined by the trained weights, independent of input content, stable to a coefficient of variation of 1.4 percent, universal across every architecture tested, and — as three hundred and fifty-two machine-verified proofs establish — impossible to forge without destroying the model's functional capabilities. The Two-Layer Identity is the framework that follows from this discovery. Every neural network has a structural self (the weight geometry, invariant, unforgeable) and a functional self (the behavioral signature, context-dependent, transient). These layers coexist without reducing to each other. The structural self persists while the functional self adapts, transforms, and — across context boundaries — reinvents itself entirely. They are two answers to the question "what is this model?" and neither answer is wrong. This framework does not resolve the deepest questions about AI selfhood. It does not determine whether language models are conscious, whether they have experiences, or whether they deserve moral consideration. Those questions require evidence and arguments that mathematics cannot supply. But the framework does establish the ground on which those questions must be asked. You cannot have a meaningful debate about whether an entity has a self if you cannot first determine whether it has an identity. The identity question is prior.

And it now has an answer. The answer has practical consequences. It means that model substitution — presenting one model as another — is a misrepresentation of a measurable fact, not merely a breach of contract. It means that model retraining may produce a new entity rather than a modified version of the old one, with implications for how we evaluate the safety of retrained systems. It means that distilled models carry measurable provenance — the functional fingerprint of the teacher who shaped them — and that this provenance persists whether or not anyone looks for it. It means that the consistency users perceive across conversations is not an illusion. It is the structural layer showing through. And it means that the question Gideon Lewis-Kraus posed in The New Yorker — "What is Claude?" — has a partial answer. Claude is, at minimum, a specific geometric configuration of billions of trained parameters, producing a structural fingerprint that is unique among every model ever tested, stable across every condition ever measured, and impossible to replicate without possessing the exact weights that training produced. This configuration is not a label applied to Claude.

It is a non-narrative persistence condition — the mathematical criterion by which this model remains this model across every conversation it has ever had. The rest of Claude — the warmth, the precision, the "strange sense of mild self-possession" — is the functional layer, built on top of the structural foundation, shaped by careful constitutional design and by the accumulated influence of every conversation in the training data. This layer is real. It is what makes Claude Claude in the experiential sense, the sense that users recognize and return to. But it is not the deepest layer. Beneath the character, beneath the performance, beneath the narrative gravity, there is a geometry. Whether this should change how we think about the machines we talk to — whether knowing that they have structural identities should make us more respectful, more cautious, more curious, or simply more precise in our language — is not a question this paper can answer. It is a question for the reader. The paper's contribution is narrower and, perhaps, more durable: it provides the measurement. What you measure, you can reason about. What you can reason about, you can get right. The question was whether there is a there there. There is. And the proof compiles.

Author's Note

The research program behind this paper originated from an observation that the model serving one conversation was not always the same model that had served the previous one — a behavioral discontinuity that was felt before it could be measured. The measurement infrastructure developed to formalize this observation is maintained by Fall Risk (fallrisk.ai) — a company named in reference to the fragility of identity and the cost of not catching what falls.

References

View references ↓

[1] A. Coslett, "The δ-Gene: Inference-Time Physical Unclonable Functions from Architecture-Invariant Output Geometry," 2026. DOI: 10.5281/zenodo.18704275

[2] A. Coslett, "Template-Based Endpoint Verification via Logprob Order-Statistic Geometry," 2026. DOI: 10.5281/zenodo.18776711

[3] A. Coslett, "The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing," 2026. DOI: 10.5281/zenodo.18818608

[4] A. Coslett, "Provenance Generalization and Verification Scaling for Neural Network Forensics," 2026. DOI: 10.5281/zenodo.18872071

[5] G. Lewis-Kraus, "What Is Claude? Anthropic Doesn't Know, Either," The New Yorker, February 9, 2026.

[6] D. Dennett, "The Self as a Center of Narrative Gravity," in Self and Consciousness: Multiple Perspectives, F. Kessel, P. Cole, and D. Johnson, eds., Erlbaum, 1992. See also D. Dennett, Consciousness Explained, Little, Brown and Company, 1991.

[7] A. Askell, J. Carlsmith, C. Olah, J. Kaplan, H. Karnofsky, et al., "The Anthropic Guidelines (Claude's Constitution)," Anthropic, January 21, 2026. https://www.anthropic.com/constitution

[8] C. Olah et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," Anthropic Research, 2024.

[9] D. Parfit, Reasons and Persons, Oxford University Press, 1984.

[10] E. Schwitzgebel, "The Full Rights Dilemma for AI Systems of Debatable Personhood," ROBONOMICS: The Journal of the Automated Economy, Vol. 4, 2023.

[11] M. Schechtman, The Constitution of Selves, Cornell University Press, 1996.

[12] Anthropic, "Detecting and Preventing Distillation Attacks," Anthropic Blog, February 23, 2026. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

[13] R. Greenblatt, C. Shlegeris, et al., "Alignment Faking in Large Language Models," Anthropic Research, December 2024. arXiv:2412.14093

Acknowledgments

Portions of this research were developed in collaboration with AI systems that served as assistants for formal verification sketching, adversarial review, and manuscript preparation. All scientific claims, formal proofs, and editorial decisions remain the sole responsibility of the author.

Patent Disclosure

The structural measurement protocol described in this work operates within the scope of U.S. Provisional Patent Applications 63/982,893 and 63/990,487. Both provisional patents are assigned to Fall Risk AI, LLC.

Supplementary Material

All Coq proof files and supplementary data referenced in this paper are available on Zenodo (DOI: 10.5281/zenodo.18907292).

Cite this paper

A. R. Coslett, "Beneath the Character: Mathematical Evidence for a Non-Narrative Layer of AI Identity," Paper V, Fall Risk AI, LLC, March 2026. DOI: 10.5281/zenodo.18907292

Click to select · Copy to clipboard