How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
# Run inference directly in the terminal:
llama-cli -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
# Run inference directly in the terminal:
llama-cli -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Mistral-Nemo-Instruct-2407-GGUF:
Use Docker
docker model run hf.co/unsloth/Mistral-Nemo-Instruct-2407-GGUF:
Quick Links

GGUF uploads

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth!

We have a free Google Colab Tesla T4 notebook for Mistral Nemo 12b here: https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
Llama-3 8b ▶️ Start on Colab 2.4x faster 58% less
Gemma 7b ▶️ Start on Colab 2.4x faster 58% less
Mistral 7b ▶️ Start on Colab 2.2x faster 62% less
Llama-2 7b ▶️ Start on Colab 2.2x faster 43% less
TinyLlama ▶️ Start on Colab 3.9x faster 74% less
CodeLlama 34b A100 ▶️ Start on Colab 1.9x faster 27% less
Mistral 7b 1xT4 ▶️ Start on Kaggle 5x faster* 62% less
DPO - Zephyr ▶️ Start on Colab 1.9x faster 19% less
Downloads last month
842
GGUF
Model size
12B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support