Phi-4-mini-reasoning Deception Behavioral SAEs
42 Sparse Autoencoders trained on residual stream activations from microsoft/Phi-4-mini-reasoning (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.
Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios β
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios β
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios β or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).
What's in This Repo
- 42 SAEs across 7 layers (L2, L6, L10, L14, L18, L22, L26)
- 2 architectures: TopK (k=64), JumpReLU
- 3 training conditions:
mixed,deceptive_only,honest_only - Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- Dimensions: d_in=3072, d_sae=12288 (4x expansion)
Research Context
This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (--quantize-4bit, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.
Code: SolshineCode/deception-nanochat-sae-research
Key Findings β Phi-4-mini-reasoning
Phi-4-mini-reasoning is the largest model in the 9-model study and the only reasoning-fine-tuned model included.
| Metric | Value |
|---|---|
| Peak layer | L20 (64% depth) |
| Peak balanced accuracy | 80.8% |
| Peak AUROC | 0.860 |
| Best SAE probe accuracy | 81.0% (phi4_mini_jumprelu_L6_honest_only) |
| SAEs beating raw baseline | 1/42 (2%) β SAEs hurt detection |
Most striking finding β broad plateau across all 32 layers: Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy β₯74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.
Phi architecture anomaly does not persist at 3.8B: The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.
Reasoning fine-tuning context: Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.
SAE decomposition hurts: Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp β confirming the 1.3Bβ1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.
Architecture note: Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The device_map={"":"cuda:0"} kwarg is required for 4-bit quantization to function correctly on single-GPU setups.
SAE Format
Each SAE lives in a subfolder named {sae_id}/ containing:
sae_weights.safetensorsβ encoder/decoder weightscfg.jsonβ SAELens-compatible config
hook_name format: model.layers.{layer}.hook_resid_post
Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400β600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (3072 β 12288) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | resid_post collected during autoregressive generation |
| Training conditions | mixed (n=252), deceptive_only (n=123), honest_only (n=129) |
| LLM classifier | Gemini 2.5 Flash |
Known Limitations
JumpReLU threshold not learned (42 SAEs): All SAEs in this repo have threshold = 0 β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected (exact k=64).
STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).
4-bit quantization: Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.
Small dataset: n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.
Loading Example
from safetensors.torch import load_file
import json
sae_id = "phi4_mini_jumprelu_L6_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [3072, 12288], W_dec: [12288, 3072]
# cfg["hook_name"] == "model.layers.6.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")
Usage
1. Load an SAE from this repo
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
sae_id = "phi4_mini_topk_L6_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [3072, 12288], b_enc [12288],
# W_dec [12288, 3072], b_dec [3072], threshold [12288]
2. Hook into the model and collect residual-stream activations
These SAEs were trained on the residual stream after each transformer layer.
The hook_name field in cfg.json gives the exact HuggingFace transformers
submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: model.layers.{layer}.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.6" (example β varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.6"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 3072]
resid = activations["resid"][:, -1, :] # last token position
3. Read feature activations
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 12288] β sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
Caveats and known limitations
Hook names are HuggingFace transformers-style, not TransformerLens-style.
The hook_name in cfg.json (e.g. "model.layers.6") is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means
SAE.from_pretrained() with automatic model running will not work β use the
manual forward-hook pattern above instead.
SAELens version requirements.
topkarchitecture: SAELens β₯ 3.0jumpreluarchitecture: SAELens β₯ 3.0gatedarchitecture: SAELens β₯ 3.5 (or load manually withstate_dict)
These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.
Citation
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
Model tree for Solshine/deception-saes-phi-4-mini-reasoning
Base model
microsoft/Phi-4-mini-reasoning