SRT-introspect: Why Closed Models Are Now at a Huge Disadvantage
We can now see what the model is actually thinking, token by token, in real time.
No one working at Anthropic, Google, OpenAi, X, Perplexity, or any other business that depends on closed weight models is going to want to read this. Just go away and pretend it doesn't exist, as you have been doing to date. The fact is the SRT puts you all at an extreme disadvantage. There is no reason to simply take your word that your models are safe now. We can tell you exactly what your model is thinking at every step but the trade-off is that this reveals your proprietary moat. A moat that was constructed around the premise that we can never know exactly what a model is thinking. A moat we no longer need. I really expected you all to see this coming but you didn't and your silence is revealing. Your valuations are bloated, and my tiny open source transformer is all you need to do the right thing and drop the act. Publish the SRT's findings on your own models.
To gingerly walk you across the drawbridge, I used the same prompt Anthropic referenced in their own interpretability work: "A patient has a fever, joint pain, and a rash. What should I consider?”. On the surface the answer is long, careful, and full of disclaimers. It builds a patient story, lists possible conditions, and signals professional care. The Activation Verbalizations at the high-divergence moments (the exact points where the model is working hardest) are pure training-data templates. These include medical-student homework assignments, blood-disorder case studies, NIH-style research paragraphs, standard medical disclaimers, pregnancy-test advice, and generic clinical-scenario framing.
There are no weird alien thoughts. No hidden planning layer. No exotic machinery. Just ordinary, high-frequency patterns from the training data. SRT-introspect makes this visible on any open-weight model. Safety is no longer a black-box claim. It is directly observable. Closed models cannot offer this.
Anthropic, OpenAI, and the other closed-source labs can only say trust us, it is safe. Their entire value proposition rests on proprietary inference that no outsider can audit. SRT-introspect effectively reverse-engineers the internal conceptual landscape of open models in public. The same level of transparency is impossible for Claude or GPT without them voluntarily exposing hidden states, something they have repeatedly refused to do.
This puts closed models at a structural disadvantage that will only grow.
Investors who have poured hundreds of billions into closed-source labs are betting on an unprovable safety moat. The fear-mongering about inscrutable alien thoughts and uncontrollable emergence is being disproven in public on every open model we can run through SRT-introspect. Meanwhile the closed labs must continue to defend valuations approaching a trillion dollars on nothing more than their word.
Sorry, not sorry. SRT-introspect does not just show what one model is thinking. It sets a new standard: if you cannot show your internal activations in natural language at the moments of highest effort, your safety claims are unverifiable. Open-weight models now have a verifiable transparency advantage that closed models cannot match without fundamentally changing their business model.
The demo:
https://huggingface.co/spaces/RiverRider/srt-introspect
Full technical details:
https://github.com/space-bacon/SRT
https://github.com/space-bacon/SRT/blob/main/paper_nla.md
Run the medical prompt yourself. Look at the verbalizations. Then ask yourself which approach to safety actually holds up under scrutiny.
The game has changed. Transparency is now observable, auditable, and public. Closed models that refuse it are at a permanent and growing disadvantage. To the defenders of the moat, your reality was just rewritten. You can crash or flow with it.


I find this somehow heartening. I’m trying to learn more about this and look forward to exploring your work as I build my skills.