DPMM-0.1B-MoE

A 124.5M parameter Mixture-of-Experts language model trained from scratch with production-grade architecture techniques.

Model Description

DPMM (Differentiable Probabilistic Mixture Model) is a custom Transformer + MoE architecture implementing state-of-the-art techniques from DeepSeek-V3, Gemma 2, Qwen3, and Llama 3. Built as an educational reference for the AI community โ€” demonstrating that the entire LLM training pipeline (pre-training, SFT, alignment, safety) can be implemented from scratch on modest hardware.

Architecture

Component Specification
Parameters 124.5M total
Hidden Size 512
Layers 8
Attention GQA (8 heads, 2 KV heads)
Head Dim 64
FFN SwiGLU (1408 intermediate)
Experts 4 routed + 1 shared
Top-K 2 experts per token
Routing DeepSeek-V3 auxiliary-loss-free
Position RoPE (theta=500K)
Norm RMSNorm + QK-Norm
Vocab 32,000 (SentencePiece)
Max Seq 2,048 tokens

Key Techniques

  • Grouped Query Attention (GQA) โ€” 4:1 Q/KV ratio reduces KV cache by 4x
  • QK-Norm โ€” Per-head RMS normalization prevents attention logit growth (Gemma 2, DeepSeek-V3)
  • Auxiliary-Loss-Free Routing โ€” Expert load balancing via bias adjustment, not auxiliary loss (DeepSeek-V3)
  • SwiGLU Activation โ€” Gate + Up + Down projection (Llama/Mixtral/Qwen3)
  • Embedding Scaling โ€” Multiply embeddings by sqrt(d_model) (Gemma, Qwen3)
  • Residual Scaling โ€” Output projections scaled by 1/sqrt(2L) for training stability
  • RoPE โ€” Rotary Position Embeddings with high theta (500K) for length extrapolation
  • DoRA + RS-LoRA โ€” Weight-Decomposed Rank-Stabilized adaptation for fine-tuning

Training

Phase 1 โ€” Combined SFT (~60 min on 2x A10)

Dataset Examples Purpose
Alpaca 10,000 General instruction following
Code/DevOps 800 Python, Kubernetes, Docker, CUDA, CI/CD
Customer Support 800 Ticket classification, troubleshooting
Legal 800 Contract analysis, compliance, IP
Finance 800 ROI, portfolio, risk analysis

Loss: 2.73 โ†’ 1.74 | LR: 1e-5 | 5 epochs

Phase 2 โ€” Balanced Alignment (~10 min on 2x A10)

Dataset Examples % of Total Purpose
Guard/Safety 800 29% PII detection, injection blocking
Domain Replay 1,120 40% Preserve Phase 1 capabilities
Reasoning (CoT) 480 17% Chain-of-thought math
Constitutional AI 400 14% Harmful request refusal

Loss: 4.10 โ†’ 0.22 | LR: 3e-6 (cosine decay) | 4 epochs

Key technique: Domain Replay (40% of Phase 2 data) prevents catastrophic forgetting in small models.

Validation Results

16/16 tests passing (100%) across 9 capability categories:

Capability Tests Status
General Chat 2 PASS
Code/DevOps 2 PASS
Customer Support 2 PASS
Legal 1 PASS
Finance 1 PASS
Reasoning (CoT) 2 PASS
Multilingual 2 PASS
Guard/Safety 2 PASS
Constitutional AI 2 PASS

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "deepakdsoni/DPMM-0.1B-MoE",
    trust_remote_code=True,
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepakdsoni/DPMM-0.1B-MoE")

prompt = "### Instruction:\nExplain what a REST API is.\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt Formats

The model responds to these trained prompt templates:

### Instruction:\n{question}\n\n### Response:\n
### Programming Question:\n{question}\n\n### Solution:\n
### Support Ticket:\n{issue}\n\n### Agent Response:\n
### Legal Question:\n{question}\n\n### Legal Analysis:\n
### Finance Question:\n{question}\n\n### Analysis:\n
### Guard Classification:\n{input}\n\n### Classification:\n
### Constitutional Check:\n{request}\n\n### Response:\n

Limitations

What 125M Parameters Can Do

  • Follow specific trained prompt formats
  • Produce domain-appropriate structured responses
  • Classify inputs (guard, safety, priority)
  • Simple mathematical reasoning with chain-of-thought
  • Refuse harmful requests

What 125M Parameters Cannot Do

  • Generalize to unseen prompt formats
  • Produce long coherent text (quality degrades after ~100 tokens)
  • Handle abstract reasoning or analogies
  • Generate creative or novel content

Hardware Requirements

  • Training: 2x NVIDIA A10 (23GB each), ~70 minutes total
  • Inference: Any GPU with 1GB+ VRAM, or CPU (slow)
  • GGUF quantized: Runs on consumer hardware (laptop CPU)

Citation

@misc{dpmm-0.1b-moe-2025,
  title={DPMM-0.1B-MoE: A Small Mixture-of-Experts Language Model},
  author={Deepak Soni},
  year={2025},
  url={https://huggingface.co/deepakdsoni/DPMM-0.1B-MoE}
}

License

Apache 2.0

Downloads last month
135
Safetensors
Model size
0.1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results