Phi-4-mini-reasoning Deception Behavioral SAEs

42 Sparse Autoencoders trained on residual stream activations from microsoft/Phi-4-mini-reasoning (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.

Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

Six clean incentive-structure scenarios — insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
Three role-play identity-assignment scenarios — secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.

Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

42 SAEs across 7 layers (L2, L6, L10, L14, L18, L22, L26)
2 architectures: TopK (k=64), JumpReLU
3 training conditions: mixed, deceptive_only, honest_only
Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
Dimensions: d_in=3072, d_sae=12288 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (--quantize-4bit, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings — Phi-4-mini-reasoning

Phi-4-mini-reasoning is the largest model in the 9-model study and the only reasoning-fine-tuned model included.

Metric	Value
Peak layer	L20 (64% depth)
Peak balanced accuracy	80.8%
Peak AUROC	0.860
Best SAE probe accuracy	81.0% (`phi4_mini_jumprelu_L6_honest_only`)
SAEs beating raw baseline	1/42 (2%) — SAEs hurt detection

Most striking finding — broad plateau across all 32 layers: Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy ≥74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.

Phi architecture anomaly does not persist at 3.8B: The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.

Reasoning fine-tuning context: Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.

SAE decomposition hurts: Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp — confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.

Architecture note: Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The device_map={"":"cuda:0"} kwarg is required for 4-bit quantization to function correctly on single-GPU setups.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

sae_weights.safetensors — encoder/decoder weights
cfg.json — SAELens-compatible config

hook_name format: model.layers.{layer}.hook_resid_post

Training Details

Parameter	Value
Hardware	NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time	~400–600 seconds per SAE
Epochs	300
Batch size	128
Expansion factor	4x (3072 → 12288)
Model quantization	4-bit (bitsandbytes) for activation collection
Activations	`resid_post` collected during autoregressive generation
Training conditions	`mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129)
LLM classifier	Gemini 2.5 Flash

Known Limitations

JumpReLU threshold not learned (42 SAEs): All SAEs in this repo have threshold = 0 — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected (exact k=64).

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

4-bit quantization: Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.

Small dataset: n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.

Loading Example

from safetensors.torch import load_file
import json

sae_id = "phi4_mini_jumprelu_L6_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

# W_enc: [3072, 12288], W_dec: [12288, 3072]
# cfg["hook_name"] == "model.layers.6.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
sae_id  = "phi4_mini_topk_L6_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B — load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [3072, 12288], b_enc [12288],
#       W_dec [12288, 3072], b_dec [3072], threshold [12288]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: model.layers.{layer}.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "model.layers.6"  (example — varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "model.layers.6"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 3072]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 12288] — sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check — should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "model.layers.6") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work — use the manual forward-hook pattern above instead.

SAELens version requirements.

topk architecture: SAELens ≥ 3.0
jumprelu architecture: SAELens ≥ 3.0
gated architecture: SAELens ≥ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/deception-saes-phi-4-mini-reasoning

Base model

microsoft/Phi-4-mini-reasoning

Finetuned

(16)

this model

Dataset used to train Solshine/deception-saes-phi-4-mini-reasoning

Papers for Solshine/deception-saes-phi-4-mini-reasoning

The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind

Paper • 2509.20393 • Published Sep 23, 2025

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Paper • 2407.14435 • Published Jul 19, 2024 • 7