Qwen3-4B PokerBench GRPO
Qwen3-4B fine-tuned for poker decision-making using GRPO (Group Relative Policy Optimization) reinforcement learning.
Model Description
This model was trained in two stages:
- SFT Stage: Fine-tuned on high quality reasoning traces.
- GRPO Stage: Further refined using reinforcement learning with LLM-as-judge rewards on PokerBench scenarios
The GRPO training optimizes the model to produce better poker decisions by comparing multiple response generations and reinforcing those rated higher by a judge model.
Training Details
- Base Model: Qwen/Qwen3-4B-thinking-2507
- SFT Data: High quality reasoning traces on PokerBench
- GRPO Checkpoint: 1650 steps
- Reward Signal: LLM-as-judge (poker action correctness)
- Method: LoRA (r=64) with GRPO
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")
tokenizer = AutoTokenizer.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")
messages = [
{"role": "system", "content": "You are an expert poker player. Analyze the situation and provide your action."},
{"role": "user", "content": "Your poker scenario..."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output Format
The model outputs structured reasoning followed by an action:
<think>
1. Position analysis: We are on the button...
2. Hand strength: AKo is a premium hand...
3. Stack considerations: With 100bb effective...
4. Action recommendation: We should 3-bet...
</think>
<action>raise 15</action>
GGUF Versions
Quantized GGUF versions for llama.cpp/Ollama: YiPz/qwen3-4b-pokerbench-grpo-gguf
Related Models
- GGUF: YiPz/qwen3-4b-pokerbench-grpo-gguf - Quantized versions
- Downloads last month
- 10