Qwen3-4B PokerBench GRPO

Qwen3-4B fine-tuned for poker decision-making using GRPO (Group Relative Policy Optimization) reinforcement learning.

Model Description

This model was trained in two stages:

  1. SFT Stage: Fine-tuned on high quality reasoning traces.
  2. GRPO Stage: Further refined using reinforcement learning with LLM-as-judge rewards on PokerBench scenarios

The GRPO training optimizes the model to produce better poker decisions by comparing multiple response generations and reinforcing those rated higher by a judge model.

Training Details

  • Base Model: Qwen/Qwen3-4B-thinking-2507
  • SFT Data: High quality reasoning traces on PokerBench
  • GRPO Checkpoint: 1650 steps
  • Reward Signal: LLM-as-judge (poker action correctness)
  • Method: LoRA (r=64) with GRPO

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")
tokenizer = AutoTokenizer.from_pretrained("YiPz/qwen3-4b-pokerbench-grpo")

messages = [
    {"role": "system", "content": "You are an expert poker player. Analyze the situation and provide your action."},
    {"role": "user", "content": "Your poker scenario..."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output Format

The model outputs structured reasoning followed by an action:

<think>
1. Position analysis: We are on the button...
2. Hand strength: AKo is a premium hand...
3. Stack considerations: With 100bb effective...
4. Action recommendation: We should 3-bet...
</think>

<action>raise 15</action>

GGUF Versions

Quantized GGUF versions for llama.cpp/Ollama: YiPz/qwen3-4b-pokerbench-grpo-gguf

Related Models

Downloads last month
10
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YiPz/qwen3-4b-pokerbench-grpo

Quantizations
1 model

Dataset used to train YiPz/qwen3-4b-pokerbench-grpo