Instructions to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2", dtype="auto") - Transformers.js
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-generation', 'lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2'); - llama-cpp-python
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2", filename="base_model/smollm2-135m-instruct.bf16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16 # Run inference directly in the terminal: llama-cli -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16 # Run inference directly in the terminal: llama-cli -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16 # Run inference directly in the terminal: ./llama-cli -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
Use Docker
docker model run hf.co/lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
- LM Studio
- Jan
- vLLM
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
- SGLang
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Ollama:
ollama run hf.co/lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
- Unsloth Studio new
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 to start chatting
- Docker Model Runner
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Docker Model Runner:
docker model run hf.co/lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
- Lemonade
How to use lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2:BF16
Run and chat with the model
lemonade run user.SmolLM2-135M-Instruct-4.5bpw-exl2-BF16
List all available models
lemonade list
SmolLM2
Table of Contents
Model Summary
SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device.
SmolLM2 demonstrates significant advances over its predecessor SmolLM1, particularly in instruction following, knowledge, reasoning. The 135M model was trained on 2 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new filtered datasets we curated and will release soon. We developed the instruct version through supervised fine-tuning (SFT) using a combination of public datasets and our own curated datasets. We then applied Direct Preference Optimization (DPO) using UltraFeedback.
The instruct model additionally supports tasks such as text rewriting, summarization and function calling (for the 1.7B) thanks to datasets developed by Argilla such as Synth-APIGen-v0.1. You can find the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk and finetuning code at https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2
How to use
Transformers
pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
messages = [{"role": "user", "content": "What is gravity?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))
Chat in TRL
You can also use the TRL CLI to chat with the model from the terminal:
pip install trl
trl chat --model_name_or_path HuggingFaceTB/SmolLM2-135M-Instruct --device cpu
Evaluation
In this section, we report the evaluation results of SmolLM2. All evaluations are zero-shot unless stated otherwise, and we use lighteval to run them.
Base pre-trained model
| Metrics | SmolLM2-135M-8k | SmolLM-135M |
|---|---|---|
| HellaSwag | 42.1 | 41.2 |
| ARC (Average) | 43.9 | 42.4 |
| PIQA | 68.4 | 68.4 |
| MMLU (cloze) | 31.5 | 30.2 |
| CommonsenseQA | 33.9 | 32.7 |
| TriviaQA | 4.1 | 4.3 |
| Winogrande | 51.3 | 51.3 |
| OpenBookQA | 34.6 | 34.0 |
| GSM8K (5-shot) | 1.4 | 1.0 |
Instruction model
| Metric | SmolLM2-135M-Instruct | SmolLM-135M-Instruct |
|---|---|---|
| IFEval (Average prompt/inst) | 29.9 | 17.2 |
| MT-Bench | 19.8 | 16.8 |
| HellaSwag | 40.9 | 38.9 |
| ARC (Average) | 37.3 | 33.9 |
| PIQA | 66.3 | 64.0 |
| MMLU (cloze) | 29.3 | 28.3 |
| BBH (3-shot) | 28.2 | 25.2 |
| GSM8K (5-shot) | 1.4 | 1.4 |
Limitations
SmolLM2 models primarily understand and generate content in English. They can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
Training
Model
- Architecture: Transformer decoder
- Pretraining tokens: 2T
- Precision: bfloat16
Hardware
- GPUs: 64 H100
Software
- Training Framework: nanotron
License
Citation
@misc{allal2024SmolLM2,
title={SmolLM2 - with great data, comes great performance},
author={Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Lewis Tunstall and Agustín Piqueres and Andres Marafioti and Cyril Zakka and Leandro von Werra and Thomas Wolf},
year={2024},
}
- Downloads last month
- 6
16-bit
Model tree for lilmeaty/SmolLM2-135M-Instruct-4.5bpw-exl2
Base model
HuggingFaceTB/SmolLM2-135M