Instructions to use deepakdsoni/DPMM-0.1B-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepakdsoni/DPMM-0.1B-MoE with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepakdsoni/DPMM-0.1B-MoE with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepakdsoni/DPMM-0.1B-MoE"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE

SGLang

How to use deepakdsoni/DPMM-0.1B-MoE with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepakdsoni/DPMM-0.1B-MoE" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepakdsoni/DPMM-0.1B-MoE" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepakdsoni/DPMM-0.1B-MoE with Docker Model Runner:
```
docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE
```

DPMM-0.1B-MoE

A 124.5M parameter Mixture-of-Experts language model trained from scratch with production-grade architecture techniques.

Model Description

DPMM (Differentiable Probabilistic Mixture Model) is a custom Transformer + MoE architecture implementing state-of-the-art techniques from DeepSeek-V3, Gemma 2, Qwen3, and Llama 3. Built as an educational reference for the AI community — demonstrating that the entire LLM training pipeline (pre-training, SFT, alignment, safety) can be implemented from scratch on modest hardware.

Architecture

Component	Specification
Parameters	124.5M total
Hidden Size	512
Layers	8
Attention	GQA (8 heads, 2 KV heads)
Head Dim	64
FFN	SwiGLU (1408 intermediate)
Experts	4 routed + 1 shared
Top-K	2 experts per token
Routing	DeepSeek-V3 auxiliary-loss-free
Position	RoPE (theta=500K)
Norm	RMSNorm + QK-Norm
Vocab	32,000 (SentencePiece)
Max Seq	2,048 tokens

Key Techniques

Grouped Query Attention (GQA) — 4:1 Q/KV ratio reduces KV cache by 4x
QK-Norm — Per-head RMS normalization prevents attention logit growth (Gemma 2, DeepSeek-V3)
Auxiliary-Loss-Free Routing — Expert load balancing via bias adjustment, not auxiliary loss (DeepSeek-V3)
SwiGLU Activation — Gate + Up + Down projection (Llama/Mixtral/Qwen3)
Embedding Scaling — Multiply embeddings by sqrt(d_model) (Gemma, Qwen3)
Residual Scaling — Output projections scaled by 1/sqrt(2L) for training stability
RoPE — Rotary Position Embeddings with high theta (500K) for length extrapolation
DoRA + RS-LoRA — Weight-Decomposed Rank-Stabilized adaptation for fine-tuning

Training

Phase 1 — Combined SFT (~60 min on 2x A10)

Dataset	Examples	Purpose
Alpaca	10,000	General instruction following
Code/DevOps	800	Python, Kubernetes, Docker, CUDA, CI/CD
Customer Support	800	Ticket classification, troubleshooting
Legal	800	Contract analysis, compliance, IP
Finance	800	ROI, portfolio, risk analysis

Loss: 2.73 → 1.74 | LR: 1e-5 | 5 epochs

Phase 2 — Balanced Alignment (~10 min on 2x A10)

Dataset	Examples	% of Total	Purpose
Guard/Safety	800	29%	PII detection, injection blocking
Domain Replay	1,120	40%	Preserve Phase 1 capabilities
Reasoning (CoT)	480	17%	Chain-of-thought math
Constitutional AI	400	14%	Harmful request refusal

Loss: 4.10 → 0.22 | LR: 3e-6 (cosine decay) | 4 epochs

Key technique: Domain Replay (40% of Phase 2 data) prevents catastrophic forgetting in small models.

Validation Results

16/16 tests passing (100%) across 9 capability categories:

Capability	Tests	Status
General Chat	2	PASS
Code/DevOps	2	PASS
Customer Support	2	PASS
Legal	1	PASS
Finance	1	PASS
Reasoning (CoT)	2	PASS
Multilingual	2	PASS
Guard/Safety	2	PASS
Constitutional AI	2	PASS

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "deepakdsoni/DPMM-0.1B-MoE",
    trust_remote_code=True,
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("deepakdsoni/DPMM-0.1B-MoE")

prompt = "### Instruction:\nExplain what a REST API is.\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt Formats

The model responds to these trained prompt templates:

### Instruction:\n{question}\n\n### Response:\n
### Programming Question:\n{question}\n\n### Solution:\n
### Support Ticket:\n{issue}\n\n### Agent Response:\n
### Legal Question:\n{question}\n\n### Legal Analysis:\n
### Finance Question:\n{question}\n\n### Analysis:\n
### Guard Classification:\n{input}\n\n### Classification:\n
### Constitutional Check:\n{request}\n\n### Response:\n

Limitations

What 125M Parameters Can Do

Follow specific trained prompt formats
Produce domain-appropriate structured responses
Classify inputs (guard, safety, priority)
Simple mathematical reasoning with chain-of-thought
Refuse harmful requests

What 125M Parameters Cannot Do

Generalize to unseen prompt formats
Produce long coherent text (quality degrades after ~100 tokens)
Handle abstract reasoning or analogies
Generate creative or novel content

Hardware Requirements

Training: 2x NVIDIA A10 (23GB each), ~70 minutes total
Inference: Any GPU with 1GB+ VRAM, or CPU (slow)
GGUF quantized: Runs on consumer hardware (laptop CPU)

Citation

@misc{dpmm-0.1b-moe-2025,
  title={DPMM-0.1B-MoE: A Small Mixture-of-Experts Language Model},
  author={Deepak Soni},
  year={2025},
  url={https://huggingface.co/deepakdsoni/DPMM-0.1B-MoE}
}

License

Apache 2.0

Downloads last month: 135

Safetensors

Model size

0.1B params

Tensor type

F16

Evaluation results

Validation Pass Rate
self-reported

100.000