Instructions to use infil00p/R-4B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use infil00p/R-4B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="infil00p/R-4B-GGUF",
	filename="R-4B-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use infil00p/R-4B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf infil00p/R-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf infil00p/R-4B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf infil00p/R-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf infil00p/R-4B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf infil00p/R-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf infil00p/R-4B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf infil00p/R-4B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf infil00p/R-4B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/infil00p/R-4B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use infil00p/R-4B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "infil00p/R-4B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "infil00p/R-4B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/infil00p/R-4B-GGUF:Q4_K_M

Ollama
How to use infil00p/R-4B-GGUF with Ollama:
```
ollama run hf.co/infil00p/R-4B-GGUF:Q4_K_M
```

Unsloth Studio new

How to use infil00p/R-4B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for infil00p/R-4B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for infil00p/R-4B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for infil00p/R-4B-GGUF to start chatting

Docker Model Runner
How to use infil00p/R-4B-GGUF with Docker Model Runner:
```
docker model run hf.co/infil00p/R-4B-GGUF:Q4_K_M
```

Lemonade

How to use infil00p/R-4B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull infil00p/R-4B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.R-4B-GGUF-Q4_K_M

List all available models

lemonade list

R-4B-GGUF

This repository contains GGUF quantized versions of YannQi/R-4B, a state-of-the-art multimodal large language model designed for general-purpose auto-thinking.

⚠️ Important Notice: R-4B support is currently only available in a custom llama.cpp branch. Please use baseweight/llama.cpp (support-r-model branch) until R-4B support is merged upstream.

About R-4B

R-4B is a breakthrough multimodal LLM that autonomously switches between step-by-step thinking and direct response generation based on task complexity. This enables high-quality responses while significantly improving inference efficiency.

Key achievements:

#1 rank on the OpenCompass Multi-modal Reasoning Leaderboard among all open-source models
#1 rank under 20B parameters on the OpenCompass Multi-modal Academic Leaderboard

All credit for this amazing model goes to YannQi and the research team. Please see the original repository and arXiv paper for more details.

Quantization Information

These GGUF files are compatible with llama.cpp and were created from the original R-4B model.

Available Files

Filename	Quant Type	File Size	Description	Use Case
`R-4B-F16.gguf`	F16	8.3 GB	Original precision	Best quality, highest VRAM usage
`R-4B-Q8_0.gguf`	Q8_0	4.4 GB	Very high quality	Excellent quality/size balance
`R-4B-Q6_K.gguf`	Q6_K	3.4 GB	High quality	Good quality, moderate size
`R-4B-Q5_K_M.gguf`	Q5_K_M	3.0 GB	Medium quality	Recommended for most users
`R-4B-Q4_K_M.gguf`	Q4_K_M	2.6 GB	Good quality	Best size/quality compromise
`mmproj-R-4b-F16.gguf`	F16	780 MB	Vision projector	Required for vision tasks

Important: The mmproj-R-4b-F16.gguf file is required for all vision-language tasks. Download it along with your chosen model quantization.

Quantization Recommendations

Q4_K_M: Best balance for most users - good quality at smallest size
Q5_K_M: Recommended for better quality with reasonable size
Q6_K: High quality with larger size
Q8_0: Near-original quality, moderate compression
F16: Original precision, largest size

Usage with llama.cpp

Prerequisites

Clone and build the custom llama.cpp branch with R-4B support:

git clone https://github.com/baseweight/llama.cpp.git
cd llama.cpp
git checkout support-r-model
make

Download both the model file and mmproj-R-4b-F16.gguf from this repository

Basic Usage

# Text + Image inference
./llama-cli \
  -m R-4B-Q5_K_M.gguf \
  --mmproj mmproj-R-4b-F16.gguf \
  --image path/to/your/image.jpg \
  -p "Describe this image in detail."

Advanced Options

# With custom parameters
./llama-cli \
  -m R-4B-Q5_K_M.gguf \
  --mmproj mmproj-R-4b-F16.gguf \
  --image image.jpg \
  -p "What is happening in this image?" \
  -c 4096 \
  -n 512 \
  --temp 0.7 \
  --top-p 0.9

Server Mode

# Run as API server
./llama-server \
  -m R-4B-Q5_K_M.gguf \
  --mmproj mmproj-R-4b-F16.gguf \
  --host 0.0.0.0 \
  --port 8080

R-4B Features

Adaptive Thinking Modes

R-4B supports three modes of operation:

Auto-thinking Mode: Automatically decides when to use step-by-step reasoning
Thinking Mode: Explicitly uses reasoning for complex tasks
Non-thinking Mode: Direct responses for simple queries

Key Capabilities

General-purpose visual question answering
Complex logical reasoning and mathematical problem-solving
Adaptive computational efficiency
State-of-the-art performance on multimodal benchmarks

Citation

If you use this model in your research, please cite the original work:

@misc{yang2025r4bincentivizinggeneralpurposeautothinking,
      title={R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning},
      author={Qi Yang and Bolin Ni and Shiming Xiang and Han Hu and Houwen Peng and Jie Jiang},
      year={2025},
      eprint={2508.21113},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.21113},
}

Acknowledgements

This quantization repository is created to make R-4B more accessible for llama.cpp users. All credit for the original model development goes to:

YannQi and the R-4B research team
Original model available at YannQi/R-4B

The base R-4B model was developed using:

License

This model is released under the Apache 2.0 license, following the original R-4B model.

Downloads last month: 61

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for infil00p/R-4B-GGUF

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B