HGRN-1.3B Dense Baseline
HGRN-1.3B dense baseline trained on 100B tokens, used as the reference for post-training sparsity experiments in:
When Does One-Shot Pruning Beat Iterative Optimisation? Second-Order Correction for Sparse LLMs on Neuromorphic Hardware
Kimia Gholami et al., NeurIPS 2026 submission
Sparse variants (OBS-cancel-block, SparseGPT, Wanda, RIA, AWP) are available in the same organisation.
Model details
| Property | Value |
|---|---|
| Architecture | HGRN (Hierarchical Gated Recurrent Network) |
| Parameters | ~1.3B |
| Training tokens | 100B |
| Precision | bfloat16 |
| Dense WikiText-2 PPL | 14.18 |
HGRN is a gated recurrent SSM with no attention layers. It uses the flash-linear-attention (FLA) library.
Loading
import torch
import fla
from fla.models.hgrn import HGRNConfig, HGRNForCausalLM
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
AutoConfig.register("hgrn", HGRNConfig, exist_ok=True)
AutoModelForCausalLM.register(HGRNConfig, HGRNForCausalLM, exist_ok=True)
model_id = "ikimyaii/hgrn-1.3B-dense-baseline"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).cuda()
Install FLA first:
pip install flash-linear-attention
Sparsity structure
All sparse variants use semi-structured sparsity: exactly
k = floor(d_in * s) weights are zeroed per output row at sparsity ratio s.
This maps directly to Intel Loihi, where zero weights generate no spike events.
- Downloads last month
- 1