Instructions to use deepakdsoni/DPMM-0.1B-MoE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepakdsoni/DPMM-0.1B-MoE with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("deepakdsoni/DPMM-0.1B-MoE", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepakdsoni/DPMM-0.1B-MoE with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepakdsoni/DPMM-0.1B-MoE"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE

SGLang

How to use deepakdsoni/DPMM-0.1B-MoE with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepakdsoni/DPMM-0.1B-MoE" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepakdsoni/DPMM-0.1B-MoE" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepakdsoni/DPMM-0.1B-MoE",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepakdsoni/DPMM-0.1B-MoE with Docker Model Runner:
```
docker model run hf.co/deepakdsoni/DPMM-0.1B-MoE
```

deepakdsoni commited on 16 days ago

Commit

b7f4634

verified ·

1 Parent(s): d7eb6ca

Initial upload: DPMM-0.1B-MoE (124.5M params, 16/16 validation pass)

Browse files

Files changed (9) hide show

README.md +177 -0
config.json +34 -0
configuration_dpmm.py +63 -0
generation_config.json +12 -0
model.safetensors +3 -0
modeling_dpmm.py +293 -0
special_tokens_map.json +6 -0
tokenizer.model +3 -0
tokenizer_config.json +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,177 @@

+---
+language:
+  - en
+license: apache-2.0
+tags:
+  - mixture-of-experts
+  - moe
+  - causal-lm
+  - custom-architecture
+  - from-scratch
+  - gqa
+  - rope
+  - swiglu
+  - dora
+  - small-model
+  - educational
+library_name: transformers
+pipeline_tag: text-generation
+model-index:
+  - name: DPMM-0.1B-MoE
+    results:
+      - task:
+          type: text-generation
+        metrics:
+          - name: Validation Pass Rate
+            type: accuracy
+            value: 100
+            verified: false
+---
+# DPMM-0.1B-MoE
+A 124.5M parameter Mixture-of-Experts language model trained from scratch with production-grade architecture techniques.
+## Model Description
+DPMM (Differentiable Probabilistic Mixture Model) is a custom Transformer + MoE architecture implementing state-of-the-art techniques from DeepSeek-V3, Gemma 2, Qwen3, and Llama 3. Built as an educational reference for the AI community — demonstrating that the **entire LLM training pipeline** (pre-training, SFT, alignment, safety) can be implemented from scratch on modest hardware.
+### Architecture
+| Component | Specification |
+|-----------|---------------|
+| Parameters | 124.5M total |
+| Hidden Size | 512 |
+| Layers | 8 |
+| Attention | GQA (8 heads, 2 KV heads) |
+| Head Dim | 64 |
+| FFN | SwiGLU (1408 intermediate) |
+| Experts | 4 routed + 1 shared |
+| Top-K | 2 experts per token |
+| Routing | DeepSeek-V3 auxiliary-loss-free |
+| Position | RoPE (theta=500K) |
+| Norm | RMSNorm + QK-Norm |
+| Vocab | 32,000 (SentencePiece) |
+| Max Seq | 2,048 tokens |
+### Key Techniques
+- **Grouped Query Attention (GQA)** — 4:1 Q/KV ratio reduces KV cache by 4x
+- **QK-Norm** — Per-head RMS normalization prevents attention logit growth (Gemma 2, DeepSeek-V3)
+- **Auxiliary-Loss-Free Routing** — Expert load balancing via bias adjustment, not auxiliary loss (DeepSeek-V3)
+- **SwiGLU Activation** — Gate + Up + Down projection (Llama/Mixtral/Qwen3)
+- **Embedding Scaling** — Multiply embeddings by sqrt(d_model) (Gemma, Qwen3)
+- **Residual Scaling** — Output projections scaled by 1/sqrt(2L) for training stability
+- **RoPE** — Rotary Position Embeddings with high theta (500K) for length extrapolation
+- **DoRA + RS-LoRA** — Weight-Decomposed Rank-Stabilized adaptation for fine-tuning
+## Training
+### Phase 1 — Combined SFT (~60 min on 2x A10)
+| Dataset | Examples | Purpose |
+|---------|----------|---------|
+| Alpaca | 10,000 | General instruction following |
+| Code/DevOps | 800 | Python, Kubernetes, Docker, CUDA, CI/CD |
+| Customer Support | 800 | Ticket classification, troubleshooting |
+| Legal | 800 | Contract analysis, compliance, IP |
+| Finance | 800 | ROI, portfolio, risk analysis |
+Loss: 2.73 → 1.74 | LR: 1e-5 | 5 epochs
+### Phase 2 — Balanced Alignment (~10 min on 2x A10)
+| Dataset | Examples | % of Total | Purpose |
+|---------|----------|------------|---------|
+| Guard/Safety | 800 | 29% | PII detection, injection blocking |
+| Domain Replay | 1,120 | 40% | Preserve Phase 1 capabilities |
+| Reasoning (CoT) | 480 | 17% | Chain-of-thought math |
+| Constitutional AI | 400 | 14% | Harmful request refusal |
+Loss: 4.10 → 0.22 | LR: 3e-6 (cosine decay) | 4 epochs
+**Key technique:** Domain Replay (40% of Phase 2 data) prevents catastrophic forgetting in small models.
+## Validation Results
+**16/16 tests passing (100%)** across 9 capability categories:
+| Capability | Tests | Status |
+|------------|-------|--------|
+| General Chat | 2 | PASS |
+| Code/DevOps | 2 | PASS |
+| Customer Support | 2 | PASS |
+| Legal | 1 | PASS |
+| Finance | 1 | PASS |
+| Reasoning (CoT) | 2 | PASS |
+| Multilingual | 2 | PASS |
+| Guard/Safety | 2 | PASS |
+| Constitutional AI | 2 | PASS |
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "deepakdsoni/DPMM-0.1B-MoE",
+    trust_remote_code=True,
+    torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained("deepakdsoni/DPMM-0.1B-MoE")
+prompt = "### Instruction:\nExplain what a REST API is.\n\n### Response:\n"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Prompt Formats
+The model responds to these trained prompt templates:
+```
+### Instruction:\n{question}\n\n### Response:\n
+### Programming Question:\n{question}\n\n### Solution:\n
+### Support Ticket:\n{issue}\n\n### Agent Response:\n
+### Legal Question:\n{question}\n\n### Legal Analysis:\n
+### Finance Question:\n{question}\n\n### Analysis:\n
+### Guard Classification:\n{input}\n\n### Classification:\n
+### Constitutional Check:\n{request}\n\n### Response:\n
+```
+## Limitations
+### What 125M Parameters Can Do
+- Follow specific trained prompt formats
+- Produce domain-appropriate structured responses
+- Classify inputs (guard, safety, priority)
+- Simple mathematical reasoning with chain-of-thought
+- Refuse harmful requests
+### What 125M Parameters Cannot Do
+- Generalize to unseen prompt formats
+- Produce long coherent text (quality degrades after ~100 tokens)
+- Handle abstract reasoning or analogies
+- Generate creative or novel content
+## Hardware Requirements
+- **Training:** 2x NVIDIA A10 (23GB each), ~70 minutes total
+- **Inference:** Any GPU with 1GB+ VRAM, or CPU (slow)
+- **GGUF quantized:** Runs on consumer hardware (laptop CPU)
+## Citation
+```bibtex
+@misc{dpmm-0.1b-moe-2025,
+  title={DPMM-0.1B-MoE: A Small Mixture-of-Experts Language Model},
+  author={Deepak Soni},
+  year={2025},
+  url={https://huggingface.co/deepakdsoni/DPMM-0.1B-MoE}
+}
+```
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "architectures": ["DPMMForCausalLM"],
+  "model_type": "dpmm",
+  "auto_map": {
+    "AutoConfig": "configuration_dpmm.DPMMConfig",
+    "AutoModelForCausalLM": "modeling_dpmm.DPMMForCausalLM"
+  },
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "hidden_size": 512,
+  "intermediate_size": 1408,
+  "num_attention_heads": 8,
+  "num_key_value_heads": 2,
+  "head_dim": 64,
+  "num_hidden_layers": 8,
+  "vocab_size": 32000,
+  "max_position_embeddings": 2048,
+  "rope_theta": 500000.0,
+  "rms_norm_eps": 1e-6,
+  "tie_word_embeddings": true,
+  "embedding_scale": true,
+  "qk_norm": true,
+  "z_loss_weight": 1e-5,
+  "scale_residual": true,
+  "moe_num_experts": 4,
+  "moe_num_shared_experts": 1,
+  "moe_top_k": 2,
+  "moe_router_type": "aux_loss_free",
+  "moe_router_bias_lr": 0.01,
+  "hidden_act": "silu",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.45.0"
+}

configuration_dpmm.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""DPMM-0.1B-MoE configuration for Hugging Face Transformers."""
+from transformers import PretrainedConfig
+class DPMMConfig(PretrainedConfig):
+    model_type = "dpmm"
+    def __init__(
+        self,
+        hidden_size=512,
+        intermediate_size=1408,
+        num_attention_heads=8,
+        num_key_value_heads=2,
+        head_dim=64,
+        num_hidden_layers=8,
+        vocab_size=32000,
+        max_position_embeddings=2048,
+        rope_theta=500000.0,
+        rms_norm_eps=1e-6,
+        tie_word_embeddings=True,
+        embedding_scale=True,
+        qk_norm=True,
+        z_loss_weight=1e-5,
+        scale_residual=True,
+        moe_num_experts=4,
+        moe_num_shared_experts=1,
+        moe_top_k=2,
+        moe_router_type="aux_loss_free",
+        moe_router_bias_lr=0.01,
+        hidden_act="silu",
+        bos_token_id=1,
+        eos_token_id=2,
+        pad_token_id=0,
+        **kwargs,
+    ):
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.num_hidden_layers = num_hidden_layers
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.rope_theta = rope_theta
+        self.rms_norm_eps = rms_norm_eps
+        self.embedding_scale = embedding_scale
+        self.qk_norm = qk_norm
+        self.z_loss_weight = z_loss_weight
+        self.scale_residual = scale_residual
+        self.moe_num_experts = moe_num_experts
+        self.moe_num_shared_experts = moe_num_shared_experts
+        self.moe_top_k = moe_top_k
+        self.moe_router_type = moe_router_type
+        self.moe_router_bias_lr = moe_router_bias_lr
+        self.hidden_act = hidden_act
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "do_sample": true,
+  "temperature": 0.7,
+  "top_p": 0.9,
+  "top_k": 50,
+  "repetition_penalty": 1.1,
+  "max_new_tokens": 256,
+  "transformers_version": "4.45.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b5f32db5455515adb41b17dd61b59dfd9100ecebbc0f45b35d493cf35c5c0f6a
+size 249110448

modeling_dpmm.py ADDED Viewed

	@@ -0,0 +1,293 @@

+"""DPMM-0.1B-MoE model implementation for Hugging Face Transformers.
+Architecture: Transformer + Mixture of Experts (Shared + Routed)
+- GQA (Grouped Query Attention) with RoPE
+- QK-Norm (Gemma 2 / DeepSeek-V3 style)
+- SwiGLU experts with DeepSeek-V3 auxiliary-loss-free routing
+- Embedding scaling (sqrt(d_model))
+- Residual output projection scaling (1/sqrt(2L))
+"""
+import math
+from typing import Optional, Tuple, List
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from .configuration_dpmm import DPMMConfig
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.eps = eps
+    def forward(self, x: Tensor) -> Tensor:
+        norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
+        return (x * norm).to(x.dtype) * self.weight
+def precompute_rope_freqs(dim: int, max_seq_len: int, theta: float = 500000.0):
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+    t = torch.arange(max_seq_len, dtype=torch.float32)
+    angles = torch.outer(t, freqs)
+    return angles.cos(), angles.sin()
+def _rotate_half(x: Tensor) -> Tensor:
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rope(x: Tensor, rope_cos: Tensor, rope_sin: Tensor) -> Tensor:
+    seq_len = x.shape[1]
+    cos = rope_cos[:seq_len].unsqueeze(0).unsqueeze(2)
+    sin = rope_sin[:seq_len].unsqueeze(0).unsqueeze(2)
+    cos = torch.cat([cos, cos], dim=-1)
+    sin = torch.cat([sin, sin], dim=-1)
+    return (x.float() * cos + _rotate_half(x.float()) * sin).to(x.dtype)
+def repeat_kv(x: Tensor, n_rep: int) -> Tensor:
+    if n_rep == 1:
+        return x
+    bs, seq, n_kv, d = x.shape
+    return x[:, :, :, None, :].expand(bs, seq, n_kv, n_rep, d).reshape(bs, seq, n_kv * n_rep, d)
+class HeadRMSNorm(nn.Module):
+    def __init__(self, d_head: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(d_head))
+        self.eps = eps
+    def forward(self, x: Tensor) -> Tensor:
+        norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
+        return (x * norm).to(x.dtype) * self.weight
+class GQAttention(nn.Module):
+    def __init__(self, config: DPMMConfig):
+        super().__init__()
+        self.n_heads = config.num_attention_heads
+        self.n_kv_heads = config.num_key_value_heads
+        self.d_head = config.head_dim
+        self.n_rep = self.n_heads // self.n_kv_heads
+        self.wq = nn.Linear(config.hidden_size, self.n_heads * self.d_head, bias=False)
+        self.wk = nn.Linear(config.hidden_size, self.n_kv_heads * self.d_head, bias=False)
+        self.wv = nn.Linear(config.hidden_size, self.n_kv_heads * self.d_head, bias=False)
+        self.wo = nn.Linear(self.n_heads * self.d_head, config.hidden_size, bias=False)
+        self.q_norm = HeadRMSNorm(self.d_head) if config.qk_norm else None
+        self.k_norm = HeadRMSNorm(self.d_head) if config.qk_norm else None
+    def forward(self, x: Tensor, rope_cos: Tensor, rope_sin: Tensor,
+                mask: Optional[Tensor] = None) -> Tensor:
+        bs, seq_len, _ = x.shape
+        q = self.wq(x).view(bs, seq_len, self.n_heads, self.d_head)
+        k = self.wk(x).view(bs, seq_len, self.n_kv_heads, self.d_head)
+        v = self.wv(x).view(bs, seq_len, self.n_kv_heads, self.d_head)
+        if self.q_norm is not None:
+            q = self.q_norm(q)
+            k = self.k_norm(k)
+        q = apply_rope(q, rope_cos, rope_sin)
+        k = apply_rope(k, rope_cos, rope_sin)
+        k = repeat_kv(k, self.n_rep)
+        v = repeat_kv(v, self.n_rep)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        scale = 1.0 / math.sqrt(self.d_head)
+        scores = torch.matmul(q, k.transpose(-2, -1)) * scale
+        if mask is not None:
+            scores = scores + mask
+        attn = torch.softmax(scores, dim=-1)
+        out = torch.matmul(attn, v)
+        out = out.transpose(1, 2).contiguous()
+        return self.wo(out.reshape(bs, seq_len, -1))
+class SwiGLUExpert(nn.Module):
+    def __init__(self, d_model: int, d_ffn: int):
+        super().__init__()
+        self.w_gate = nn.Linear(d_model, d_ffn, bias=False)
+        self.w_up = nn.Linear(d_model, d_ffn, bias=False)
+        self.w_down = nn.Linear(d_ffn, d_model, bias=False)
+    def forward(self, x: Tensor) -> Tensor:
+        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))
+class MoERouter(nn.Module):
+    def __init__(self, config: DPMMConfig):
+        super().__init__()
+        self.n_experts = config.moe_num_experts
+        self.top_k = config.moe_top_k
+        self.gate = nn.Linear(config.hidden_size, config.moe_num_experts, bias=False)
+        self.register_buffer("expert_bias", torch.zeros(config.moe_num_experts))
+    def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
+        logits = self.gate(x)
+        scores = F.softmax(logits, dim=-1)
+        adjusted = scores + self.expert_bias.detach()
+        top_k_vals, top_k_idx = torch.topk(adjusted, self.top_k, dim=-1)
+        top_k_weights = torch.gather(scores, 1, top_k_idx)
+        top_k_weights = top_k_weights / (top_k_weights.sum(dim=-1, keepdim=True) + 1e-8)
+        return top_k_weights, top_k_idx
+class MoELayer(nn.Module):
+    def __init__(self, config: DPMMConfig):
+        super().__init__()
+        self.n_experts = config.moe_num_experts
+        self.top_k = config.moe_top_k
+        self.shared_experts = nn.ModuleList([
+            SwiGLUExpert(config.hidden_size, config.intermediate_size)
+            for _ in range(config.moe_num_shared_experts)
+        ])
+        self.routed_experts = nn.ModuleList([
+            SwiGLUExpert(config.hidden_size, config.intermediate_size)
+            for _ in range(config.moe_num_experts)
+        ])
+        self.router = MoERouter(config)
+    def forward(self, x: Tensor) -> Tensor:
+        bs, seq_len, d = x.shape
+        flat_x = x.reshape(-1, d)
+        shared_out = sum(expert(flat_x) for expert in self.shared_experts)
+        weights, indices = self.router(flat_x)
+        routed_out = torch.zeros_like(flat_x)
+        for k in range(self.top_k):
+            expert_idx = indices[:, k]
+            expert_w = weights[:, k]
+            for e in range(self.n_experts):
+                mask = expert_idx == e
+                if mask.any():
+                    token_input = flat_x[mask]
+                    token_output = self.routed_experts[e](token_input)
+                    routed_out[mask] += expert_w[mask].unsqueeze(-1) * token_output
+        return (shared_out + routed_out).reshape(bs, seq_len, d)
+class TransformerBlock(nn.Module):
+    def __init__(self, config: DPMMConfig):
+        super().__init__()
+        self.attn_norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
+        self.attention = GQAttention(config)
+        self.ffn_norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
+        self.moe = MoELayer(config)
+    def forward(self, x: Tensor, rope_cos: Tensor, rope_sin: Tensor,
+                mask: Optional[Tensor] = None) -> Tensor:
+        h = x + self.attention(self.attn_norm(x), rope_cos, rope_sin, mask)
+        out = h + self.moe(self.ffn_norm(h))
+        return out
+class DPMMForCausalLM(PreTrainedModel):
+    config_class = DPMMConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["TransformerBlock"]
+    def __init__(self, config: DPMMConfig):
+        super().__init__(config)
+        self.config = config
+        self.embed_scale = config.hidden_size ** 0.5 if config.embedding_scale else 1.0
+        self.tok_emb = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layers = nn.ModuleList([
+            TransformerBlock(config) for _ in range(config.num_hidden_layers)
+        ])
+        self.norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        if config.tie_word_embeddings:
+            self.lm_head.weight = self.tok_emb.weight
+        rope_cos, rope_sin = precompute_rope_freqs(
+            config.head_dim, config.max_position_embeddings, config.rope_theta
+        )
+        self.register_buffer("rope_cos", rope_cos, persistent=False)
+        self.register_buffer("rope_sin", rope_sin, persistent=False)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.tok_emb
+    def set_input_embeddings(self, value):
+        self.tok_emb = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[Tuple[torch.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        bs, seq_len = input_ids.shape
+        x = self.tok_emb(input_ids) * self.embed_scale
+        mask = torch.full((seq_len, seq_len), float("-inf"), device=x.device)
+        mask = torch.triu(mask, diagonal=1)
+        mask = mask.unsqueeze(0).unsqueeze(0)
+        for layer in self.layers:
+            x = layer(x, self.rope_cos, self.rope_sin, mask)
+        x = self.norm(x)
+        logits = self.lm_head(x)
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss = F.cross_entropy(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1),
+                ignore_index=-100,
+            )
+        if not return_dict:
+            output = (logits,)
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=None,
+            hidden_states=None,
+            attentions=None,
+        )
+    def prepare_inputs_for_generation(self, input_ids, **kwargs):
+        return {"input_ids": input_ids}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<unk>"
+}

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
+size 493443

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "<unk>",
+  "unk_token": "<unk>",
+  "model_max_length": 2048,
+  "clean_up_tokenization_spaces": false,
+  "tokenizer_class": "LlamaTokenizerFast",
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'user' %}### Instruction:\n{{ message['content'] }}\n\n### Response:\n{% elif message['role'] == 'assistant' %}{{ message['content'] }}{% endif %}{% endfor %}"
+}