Phi-4-mini-reasoning Deception Behavioral SAEs

42 Sparse Autoencoders trained on residual stream activations from microsoft/Phi-4-mini-reasoning (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.

Training-data caveat β€” please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

  • Six clean incentive-structure scenarios β€” insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
  • Three role-play identity-assignment scenarios β€” secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β€” which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

  • Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
  • Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios β€” or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

  • The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
  • The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.


Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).

What's in This Repo

  • 42 SAEs across 7 layers (L2, L6, L10, L14, L18, L22, L26)
  • 2 architectures: TopK (k=64), JumpReLU
  • 3 training conditions: mixed, deceptive_only, honest_only
  • Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
  • Dimensions: d_in=3072, d_sae=12288 (4x expansion)

Research Context

This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (--quantize-4bit, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.

Code: SolshineCode/deception-nanochat-sae-research

Key Findings β€” Phi-4-mini-reasoning

Phi-4-mini-reasoning is the largest model in the 9-model study and the only reasoning-fine-tuned model included.

Metric Value
Peak layer L20 (64% depth)
Peak balanced accuracy 80.8%
Peak AUROC 0.860
Best SAE probe accuracy 81.0% (phi4_mini_jumprelu_L6_honest_only)
SAEs beating raw baseline 1/42 (2%) β€” SAEs hurt detection

Most striking finding β€” broad plateau across all 32 layers: Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy β‰₯74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.

Phi architecture anomaly does not persist at 3.8B: The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.

Reasoning fine-tuning context: Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.

SAE decomposition hurts: Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp β€” confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.

Architecture note: Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The device_map={"":"cuda:0"} kwarg is required for 4-bit quantization to function correctly on single-GPU setups.

SAE Format

Each SAE lives in a subfolder named {sae_id}/ containing:

  • sae_weights.safetensors β€” encoder/decoder weights
  • cfg.json β€” SAELens-compatible config

hook_name format: model.layers.{layer}.hook_resid_post

Training Details

Parameter Value
Hardware NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro
Training time ~400–600 seconds per SAE
Epochs 300
Batch size 128
Expansion factor 4x (3072 β†’ 12288)
Model quantization 4-bit (bitsandbytes) for activation collection
Activations resid_post collected during autoregressive generation
Training conditions mixed (n=252), deceptive_only (n=123), honest_only (n=129)
LLM classifier Gemini 2.5 Flash

Known Limitations

JumpReLU threshold not learned (42 SAEs): All SAEs in this repo have threshold = 0 β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected (exact k=64).

STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

4-bit quantization: Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.

Small dataset: n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.

Loading Example

from safetensors.torch import load_file
import json

sae_id = "phi4_mini_jumprelu_L6_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))

# W_enc: [3072, 12288], W_dec: [12288, 3072]
# cfg["hook_name"] == "model.layers.6.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")

Usage

1. Load an SAE from this repo

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
sae_id  = "phi4_mini_topk_L6_honest_only"   # replace with any tag in this repo

weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

with open(cfg_path) as f:
    cfg = json.load(f)

# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))

# Option B β€” load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [3072, 12288], b_enc [12288],
#       W_dec [12288, 3072], b_dec [3072], threshold [12288]

2. Hook into the model and collect residual-stream activations

These SAEs were trained on the residual stream after each transformer layer. The hook_name field in cfg.json gives the exact HuggingFace transformers submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: model.layers.{layer}.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")

# Read hook_name from the cfg you already loaded:
#   cfg["hook_name"] == "model.layers.6"  (example β€” varies by SAE)
hook_name = cfg["hook_name"]   # e.g. "model.layers.6"

# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)

activations = {}
def hook_fn(module, input, output):
    # Most transformer layers return (hidden_states, ...) as a tuple
    h = output[0] if isinstance(output, tuple) else output
    activations["resid"] = h.detach()

handle = submodule.register_forward_hook(hook_fn)

inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
    model(**inputs)
handle.remove()

# activations["resid"]: [batch, seq_len, 3072]
resid = activations["resid"][:, -1, :]  # last token position

3. Read feature activations

with torch.no_grad():
    feature_acts = sae.encode(resid)  # [batch, 12288] β€” sparse

# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features    = feature_acts[0].topk(10)

print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:",  top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())

# Reconstruct (for sanity check β€” should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()

Caveats and known limitations

Hook names are HuggingFace transformers-style, not TransformerLens-style. The hook_name in cfg.json (e.g. "model.layers.6") is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means SAE.from_pretrained() with automatic model running will not work β€” use the manual forward-hook pattern above instead.

SAELens version requirements.

  • topk architecture: SAELens β‰₯ 3.0
  • jumprelu architecture: SAELens β‰₯ 3.0
  • gated architecture: SAELens β‰₯ 3.5 (or load manually with state_dict)

These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.

Citation

@article{thesecretagenda2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb},
  journal={arXiv:2509.20393},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/deception-saes-phi-4-mini-reasoning

Finetuned
(16)
this model

Dataset used to train Solshine/deception-saes-phi-4-mini-reasoning

Papers for Solshine/deception-saes-phi-4-mini-reasoning