Qwen3-4B PokerBench GRPO

Qwen3-4B fine-tuned for poker decision-making using GRPO (Group Relative Policy Optimization) reinforcement learning.

Model Description

This model was trained in two stages:

SFT Stage: Fine-tuned on high quality reasoning traces.
GRPO Stage: Further refined using reinforcement learning with LLM-as-judge rewards on PokerBench scenarios

The GRPO training optimizes the model to produce better poker decisions by comparing multiple response generations and reinforcing those rated higher by a judge model.

Training Details

Base Model: Qwen/Qwen3-4B-thinking-2507
SFT Data: High quality reasoning traces on PokerBench
GRPO Checkpoint: 1650 steps
Reward Signal: LLM-as-judge (poker action correctness)
Method: LoRA (r=64) with GRPO

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")
tokenizer = AutoTokenizer.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")

messages = [
    {"role": "system", "content": "You are an expert poker player. Analyze the situation and provide your action."},
    {"role": "user", "content": "Your poker scenario..."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output Format

The model outputs structured reasoning followed by an action:

<think>
1. Position analysis: We are on the button...
2. Hand strength: AKo is a premium hand...
3. Stack considerations: With 100bb effective...
4. Action recommendation: We should 3-bet...
</think>

<action>raise 15</action>

GGUF Versions

Quantized GGUF versions for llama.cpp/Ollama: YiPz/qwen3-4b-pokerbench-grpo-gguf

Related Models

GGUF: YiPz/qwen3-4b-pokerbench-grpo-gguf - Quantized versions

Downloads last month: 10

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for YiPz/qwen3-4b-pokerbench-grpo

Quantizations

1 model

YiPz
/

qwen3-4b-pokerbench-grpo