Instructions to use deepakdsoni/DPMM-0.1B-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepakdsoni/DPMM-0.1B-MoE with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use deepakdsoni/DPMM-0.1B-MoE with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepakdsoni/DPMM-0.1B-MoE" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepakdsoni/DPMM-0.1B-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE
- SGLang
How to use deepakdsoni/DPMM-0.1B-MoE with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepakdsoni/DPMM-0.1B-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepakdsoni/DPMM-0.1B-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepakdsoni/DPMM-0.1B-MoE" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepakdsoni/DPMM-0.1B-MoE", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepakdsoni/DPMM-0.1B-MoE with Docker Model Runner:
docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE
DPMM-0.1B-MoE
A 124.5M parameter Mixture-of-Experts language model trained from scratch with production-grade architecture techniques.
Model Description
DPMM (Differentiable Probabilistic Mixture Model) is a custom Transformer + MoE architecture implementing state-of-the-art techniques from DeepSeek-V3, Gemma 2, Qwen3, and Llama 3. Built as an educational reference for the AI community โ demonstrating that the entire LLM training pipeline (pre-training, SFT, alignment, safety) can be implemented from scratch on modest hardware.
Architecture
| Component | Specification |
|---|---|
| Parameters | 124.5M total |
| Hidden Size | 512 |
| Layers | 8 |
| Attention | GQA (8 heads, 2 KV heads) |
| Head Dim | 64 |
| FFN | SwiGLU (1408 intermediate) |
| Experts | 4 routed + 1 shared |
| Top-K | 2 experts per token |
| Routing | DeepSeek-V3 auxiliary-loss-free |
| Position | RoPE (theta=500K) |
| Norm | RMSNorm + QK-Norm |
| Vocab | 32,000 (SentencePiece) |
| Max Seq | 2,048 tokens |
Key Techniques
- Grouped Query Attention (GQA) โ 4:1 Q/KV ratio reduces KV cache by 4x
- QK-Norm โ Per-head RMS normalization prevents attention logit growth (Gemma 2, DeepSeek-V3)
- Auxiliary-Loss-Free Routing โ Expert load balancing via bias adjustment, not auxiliary loss (DeepSeek-V3)
- SwiGLU Activation โ Gate + Up + Down projection (Llama/Mixtral/Qwen3)
- Embedding Scaling โ Multiply embeddings by sqrt(d_model) (Gemma, Qwen3)
- Residual Scaling โ Output projections scaled by 1/sqrt(2L) for training stability
- RoPE โ Rotary Position Embeddings with high theta (500K) for length extrapolation
- DoRA + RS-LoRA โ Weight-Decomposed Rank-Stabilized adaptation for fine-tuning
Training
Phase 1 โ Combined SFT (~60 min on 2x A10)
| Dataset | Examples | Purpose |
|---|---|---|
| Alpaca | 10,000 | General instruction following |
| Code/DevOps | 800 | Python, Kubernetes, Docker, CUDA, CI/CD |
| Customer Support | 800 | Ticket classification, troubleshooting |
| Legal | 800 | Contract analysis, compliance, IP |
| Finance | 800 | ROI, portfolio, risk analysis |
Loss: 2.73 โ 1.74 | LR: 1e-5 | 5 epochs
Phase 2 โ Balanced Alignment (~10 min on 2x A10)
| Dataset | Examples | % of Total | Purpose |
|---|---|---|---|
| Guard/Safety | 800 | 29% | PII detection, injection blocking |
| Domain Replay | 1,120 | 40% | Preserve Phase 1 capabilities |
| Reasoning (CoT) | 480 | 17% | Chain-of-thought math |
| Constitutional AI | 400 | 14% | Harmful request refusal |
Loss: 4.10 โ 0.22 | LR: 3e-6 (cosine decay) | 4 epochs
Key technique: Domain Replay (40% of Phase 2 data) prevents catastrophic forgetting in small models.
Validation Results
16/16 tests passing (100%) across 9 capability categories:
| Capability | Tests | Status |
|---|---|---|
| General Chat | 2 | PASS |
| Code/DevOps | 2 | PASS |
| Customer Support | 2 | PASS |
| Legal | 1 | PASS |
| Finance | 1 | PASS |
| Reasoning (CoT) | 2 | PASS |
| Multilingual | 2 | PASS |
| Guard/Safety | 2 | PASS |
| Constitutional AI | 2 | PASS |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"deepakdsoni/DPMM-0.1B-MoE",
trust_remote_code=True,
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepakdsoni/DPMM-0.1B-MoE")
prompt = "### Instruction:\nExplain what a REST API is.\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Prompt Formats
The model responds to these trained prompt templates:
### Instruction:\n{question}\n\n### Response:\n
### Programming Question:\n{question}\n\n### Solution:\n
### Support Ticket:\n{issue}\n\n### Agent Response:\n
### Legal Question:\n{question}\n\n### Legal Analysis:\n
### Finance Question:\n{question}\n\n### Analysis:\n
### Guard Classification:\n{input}\n\n### Classification:\n
### Constitutional Check:\n{request}\n\n### Response:\n
Limitations
What 125M Parameters Can Do
- Follow specific trained prompt formats
- Produce domain-appropriate structured responses
- Classify inputs (guard, safety, priority)
- Simple mathematical reasoning with chain-of-thought
- Refuse harmful requests
What 125M Parameters Cannot Do
- Generalize to unseen prompt formats
- Produce long coherent text (quality degrades after ~100 tokens)
- Handle abstract reasoning or analogies
- Generate creative or novel content
Hardware Requirements
- Training: 2x NVIDIA A10 (23GB each), ~70 minutes total
- Inference: Any GPU with 1GB+ VRAM, or CPU (slow)
- GGUF quantized: Runs on consumer hardware (laptop CPU)
Citation
@misc{dpmm-0.1b-moe-2025,
title={DPMM-0.1B-MoE: A Small Mixture-of-Experts Language Model},
author={Deepak Soni},
year={2025},
url={https://huggingface.co/deepakdsoni/DPMM-0.1B-MoE}
}
License
Apache 2.0
- Downloads last month
- 135
Evaluation results
- Validation Pass Rateself-reported100.000