Instructions to use froogai/NousCoder-14B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use froogai/NousCoder-14B-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="froogai/NousCoder-14B-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("froogai/NousCoder-14B-AWQ") model = AutoModelForCausalLM.from_pretrained("froogai/NousCoder-14B-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use froogai/NousCoder-14B-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "froogai/NousCoder-14B-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "froogai/NousCoder-14B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/froogai/NousCoder-14B-AWQ
- SGLang
How to use froogai/NousCoder-14B-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "froogai/NousCoder-14B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "froogai/NousCoder-14B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "froogai/NousCoder-14B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "froogai/NousCoder-14B-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use froogai/NousCoder-14B-AWQ with Docker Model Runner:
docker model run hf.co/froogai/NousCoder-14B-AWQ
NousCoder-14B-AWQ
Model Description
NousCoder-14B-AWQ is a 4-bit AWQ (Activation-aware Weight Quantization) quantized version of NousResearch/NousCoder-14B.
This model specializes in competitive programming and coding tasks, achieving 67.87% Pass@1 on LiveCodeBench v6. It has been post-trained on Qwen3-14B using reinforcement learning on 24k verifiable coding problems.
Key Features
- 🔥 Specialized for Coding: Trained with RL on competitive programming problems
- ⚡ 4-bit Quantized: 66% smaller (9.4GB vs 28GB) with minimal quality loss
- 🚀 Fast Inference: Optimized for AWQ Marlin kernel (2-3x faster)
- 💻 Production Ready: Tested and verified for deployment
Model Stats
| Metric | Value |
|---|---|
| Base Model | Qwen3-14B |
| Quantization | 4-bit AWQ |
| Size | 9.4 GB (from 28 GB) |
| VRAM | ~6GB per GPU (2x GPUs) |
| Context Length | 16,384 tokens |
| LiveCodeBench v6 Pass@1 | 67.87% |
| Training | 24k coding problems (RL) |
Usage
With AutoAWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_quantized(
"froogai/NousCoder-14B-AWQ",
device_map="auto",
safetensors=True,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"froogai/NousCoder-14B-AWQ",
trust_remote_code=True,
)
# Generate code
prompt = "Write a Python function to implement binary search:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.2,
top_p=0.95,
)
code = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(code)
With vLLM (Recommended for Production)
python -m vllm.entrypoints.openai.api_server \
--model froogai/NousCoder-14B-AWQ \
--quantization awq_marlin \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.85 \
--trust-remote-code
With OpenAI-Compatible API
# Start vLLM server
vllm serve froogai/NousCoder-14B-AWQ \
--quantization awq_marlin \
--tensor-parallel-size 2
# Make API requests
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "froogai/NousCoder-14B-AWQ",
"prompt": "def quicksort(arr):",
"max_tokens": 512
}'
Quantization Details
This model was quantized using AutoAWQ with the following configuration:
{
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
Calibration
- Dataset: pileval (128 samples)
- Method: Activation-aware Weight Quantization
- Preservation: Coding intelligence maintained through careful calibration
Performance
Benchmarks
| Benchmark | Score |
|---|---|
| LiveCodeBench v6 Pass@1 | 67.87% |
| Base Model Pass@1 | 60.79% |
| Improvement | +7.08% |
Inference Speed
| Hardware | Speed (tokens/sec) |
|---|---|
| 2x RTX 5060 Ti (awq_marlin) | 15-25 |
| 2x RTX 5060 Ti (awq) | 8-12 |
| Single A100 (awq_marlin) | 40-60 |
Memory Usage
| Configuration | VRAM Usage |
|---|---|
| 2x RTX 5060 Ti (TP=2) | ~6GB per GPU |
| Single RTX 5060 Ti | ~12GB |
| Single A100 | ~6GB |
Best Use Cases
This model excels at:
- ✅ Competitive programming problems
- ✅ Algorithm implementation
- ✅ Data structure design
- ✅ Code debugging and optimization
- ✅ Technical interview preparation
- ✅ LeetCode-style challenges
Recommended Generation Parameters
For coding tasks, use these settings:
- temperature: 0.1-0.3 (for deterministic code)
- top_p: 0.95
- max_tokens: 2048+ (for complete solutions)
- presence_penalty: 0.0
- frequency_penalty: 0.0
Hardware Requirements
Minimum Requirements
- VRAM: 12GB (single GPU) or 6GB (2 GPUs with tensor parallelism)
- RAM: 24GB
- Storage: 10GB
Recommended Requirements
- GPUs: 2x NVIDIA RTX 5060 Ti 16GB (32GB total VRAM)
- RAM: 128GB
- Storage: 20GB (for model + cache)
Compatible Hardware
- NVIDIA GPUs with compute capability 7.0+ (for AWQ Marlin)
- CUDA 11.8+ or 12.1+
- Python 3.10+
Limitations
- Quantized model may have slight accuracy degradation compared to FP16
- Requires AWQ-compatible libraries (AutoAWQ or vLLM)
- Best performance on NVIDIA GPUs (CPU inference slower)
Training Details
Base Model
- Architecture: Qwen3-14B
- Parameters: 14B
- License: Apache 2.0
Post-Training
- Method: Reinforcement Learning
- Dataset: 24k verifiable coding problems
- Hardware: 48 B200 GPUs
- Duration: 4 days
- Framework: Atropos (NousResearch training system)
Acknowledgments
- Original Model: NousResearch/NousCoder-14B
- Base Model: Qwen/Qwen3-14B
- Quantization: AutoAWQ library
- Training Team: Joe Li (@JoeLi5050) at NousResearch
Citation
If you use this model, please cite:
@misc{nouscoder_14b_2025,
title={NousCoder-14B: Competitive Programming AI Model},
author={Li, Joe},
organization={NousResearch},
year={2025},
month={January},
url={https://huggingface.co/NousResearch/NousCoder-14B}
}
License
This model is licensed under the Apache 2.0 License. See the LICENSE file for details.
Model Card Authors
Quantized by: froogai
For questions or issues, please:
- Open an issue on the model repository
- Contact: HuggingFace profile
Note: This is a quantized version of the original model. For best performance, use the vLLM inference engine with the awq_marlin quantization backend.
- Downloads last month
- 206