Spaces:

williyam
/

agentic-rag-gym

Sleeping

williyam commited on Apr 24

Commit

7e07285

1 Parent(s): 5c82f3d

feat: GRPO fine-tuning pipeline + trained model for aerospace RAG

- Add Jupyter notebook (agentic-rag-for-aerospace-research.ipynb) with
full GRPO training pipeline: dataset collection, baseline eval, training,
post-training eval, plots, and HF Hub push
- Add training/ package (config, dataset, reward, evaluate modules)
- Add train.py standalone training script
- Add training plots: training_curves.png, baseline_vs_trained.png,
score_distribution.png
- GRPO-trained Qwen2.5-0.5B with LoRA (r=16, alpha=32)
- Baseline: 0.558 -> Trained: 0.586 (+0.028 improvement)
- Model pushed to williyam/agentic-rag-aerospace-grpo on HF Hub
- Update README with training section, results, and plot embeds

Files changed (13) hide show

.gitignore +4 -0
README.md +78 -0
agentic-rag-for-aerospace-research.ipynb +3 -0
plots/baseline_vs_trained.png +3 -0
plots/eval_results.json +84 -0
plots/score_distribution.png +3 -0
plots/training_curves.png +3 -0
train.py +227 -0
training/__init__.py +1 -0
training/config.py +40 -0
training/dataset.py +118 -0
training/evaluate.py +192 -0
training/reward.py +87 -0

.gitignore CHANGED Viewed

@@ -25,3 +25,7 @@ data/uploads/
 .DS_Store
 Thumbs.db
 node_modules/

 .DS_Store
 Thumbs.db
 node_modules/
+# Training checkpoints (large files)
+checkpoints/
+.venv-1/

README.md CHANGED Viewed

@@ -291,6 +291,80 @@ Update `server/app.py` to use your domain config instead of `AerospaceDomainConf
 ---
 ## Testing
 ```bash
@@ -344,9 +418,13 @@ agentic-rag-gym/
 ├── server/                  # FastAPI + Gradio server
 ├── domains/aerospace/       # Aerospace research domain
 ├── domains/legal_research/  # Legal research domain (stub)
 ├── tests/                   # Unit & integration tests (102+)
 ├── .github/workflows/       # CI pipeline
 ├── documents/               # Architecture & design docs
 ├── inference.py             # Baseline inference script
 ├── openenv.yaml             # OpenEnv specification
 ├── Dockerfile               # Container definition

 ---
+## GRPO Fine-Tuning (Reinforcement Learning)
+We fine-tune **Qwen2.5-0.5B-Instruct** using **Group Relative Policy Optimization (GRPO)** from TRL,
+with LoRA adapters and the **real domain graders** as the reward signal — no proxy rewards.
+### Training Results
+| Metric | Baseline | GRPO-Trained | Improvement |
+|--------|----------|-------------|-------------|
+| **Mean Score** | 0.5580 | 0.5860 | **+0.0280** |
+| Propulsion Comparison | 0.508 | 0.562 | +0.053 |
+| Debris Mitigation | 0.633 | 0.689 | +0.056 |
+| Hypersonic Vehicle | 0.482 | 0.521 | +0.039 |
+| Mars EDL | 0.574 | 0.568 | -0.006 |
+| Life Support | 0.592 | 0.590 | -0.002 |
+### Training Curves
+![Training Curves](plots/training_curves.png)
+### Baseline vs. GRPO-Trained
+![Baseline vs Trained](plots/baseline_vs_trained.png)
+### Score Distribution
+![Score Distribution](plots/score_distribution.png)
+### Run Training (Notebook)
+The primary training interface is the Jupyter notebook:
+```bash
+jupyter notebook agentic-rag-for-aerospace-research.ipynb
+```
+### Run Training (Script)
+For headless/CI environments:
+```bash
+python train.py
+```
+### Configuration
+| Parameter | Value |
+|-----------|-------|
+| Base Model | `Qwen/Qwen2.5-0.5B-Instruct` |
+| Method | GRPO (Group Relative Policy Optimization) |
+| LoRA | r=16, α=32, targets=q/k/v/o_proj |
+| Optimizer | AdamW (torch) |
+| Learning Rate | 5e-6 |
+| Epochs | 2 |
+| Group Size (G) | 4 |
+| Max Completion | 512 tokens |
+| Hardware | Apple M1 Pro (MPS) |
+| Training Time | ~116 min |
+### Fine-Tuned Model
+The GRPO-trained model is available on Hugging Face:
+**[williyam/agentic-rag-aerospace-grpo](https://huggingface.co/williyam/agentic-rag-aerospace-grpo)**
+```python
+from peft import AutoPeftModelForCausalLM
+from transformers import AutoTokenizer
+model = AutoPeftModelForCausalLM.from_pretrained("williyam/agentic-rag-aerospace-grpo")
+tokenizer = AutoTokenizer.from_pretrained("williyam/agentic-rag-aerospace-grpo")
+```
+---
 ## Testing
 ```bash
 ├── server/                  # FastAPI + Gradio server
 ├── domains/aerospace/       # Aerospace research domain
 ├── domains/legal_research/  # Legal research domain (stub)
+├── training/                # GRPO training package
 ├── tests/                   # Unit & integration tests (102+)
 ├── .github/workflows/       # CI pipeline
 ├── documents/               # Architecture & design docs
+├── plots/                   # Training curves & evaluation plots
+├── agentic-rag-for-aerospace-research.ipynb  # GRPO training notebook
+├── train.py                 # Standalone training script
 ├── inference.py             # Baseline inference script
 ├── openenv.yaml             # OpenEnv specification
 ├── Dockerfile               # Container definition

agentic-rag-for-aerospace-research.ipynb ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8b90675048b0211601e7e8283ad1a2c5e41ff8f6fe5e0f5620b1be16d30993e
+size 10531022

plots/baseline_vs_trained.png ADDED Viewed

Git LFS Details

SHA256: 62a092c9597ce70ea59b6dd87462ba53552c0f31504f3be615fedf03a0f219e3
Pointer size: 130 Bytes
Size of remote file: 85.1 kB

plots/eval_results.json ADDED Viewed

	@@ -0,0 +1,84 @@

+{
+  "model": "Qwen/Qwen2.5-0.5B-Instruct",
+  "finetuned": "williyam/agentic-rag-aerospace-grpo",
+  "training_time_min": 116.41443806489309,
+  "baseline": [
+    {
+      "task_id": "aero_hard_hypersonic_vehicle",
+      "task_name": "Reusable Hypersonic Space Access Vehicle",
+      "difficulty": "hard",
+      "answer": "### Conceptual Design Overview\n\n#### Introduction\nReusability of a hypersonic space access vehicle (SAV) with a combined-cycle propulsion (CCP) system is crucial for reducing operational costs, increasing payload capacity, and enhancing safety. The proposed SAV utilizes a combination of scramjet, turbine, and rocket engines, designed to achieve high-speed, long-duration hypersonic travel. The propulsion mode transitions between scramjets and rockets, allowing for efficient use of both technologi",
+      "score": 0.48219047619047617
+    },
+    {
+      "task_id": "aero_easy_propulsion_comparison",
+      "task_name": "Compare Propulsion Technologies",
+      "difficulty": "easy",
+      "answer": "### Comparison of Propulsion Technologies for a Mars Transit Mission\n\n#### Specific Impulse and Thrust Levels\n\n**Ion Propulsion (IP):**\n- **Specific Impulse:** 4,190 seconds (NASA)\n- **Thrust Level:** Very Low (0.5 N for NEXT)\n\n**Nuclear Thermal Propulsion (NTP):**\n- **Specific Impulse:** 850-1,000 seconds (NASA)\n- **Thrust Level:** Extremely High (approximately 1,000 seconds)\n\n**Comparison:**\n- **Specific Impulse:** IP > NTP\n- **Thrust Level:** IP < NTP\n\n**Impact on Mission Duration:**\n- **IP:*",
+      "score": 0.5082857142857143
+    },
+    {
+      "task_id": "aero_medium_mars_edl",
+      "task_name": "Mars EDL Architecture Design",
+      "difficulty": "medium",
+      "answer": "### Overview of the EDL Architecture Design\n\n#### 1. **Entry Phase**\n   - **Aerocapture**: \n     - **Objective**: To capture the spacecraft into the Martian atmosphere.\n     - **Data**: The spacecraft must be designed to achieve a high velocity relative to the Martian atmosphere to avoid atmospheric drag. This typically involves using a combination of parachutes, airbags, and other aerodynamic devices.\n     - **Quantitative Evidence**: The aerocapture trajectory should be optimized to minimize e",
+      "score": 0.5742857142857143
+    },
+    {
+      "task_id": "aero_medium_life_support",
+      "task_name": "Deep Space Life Support Design",
+      "difficulty": "medium",
+      "answer": "### Deep Space Life Support Design for a 6-Person Crew on a 2.5-Year Mars Mission\n\n#### Overview\nThis design integrates physicochemical and bioregenerative systems to maximize closure ratios and address various aspects of life support on a 2.5-year Mars mission. The goal is to ensure the survival and health of the crew while minimizing resource consumption and environmental impact.\n\n#### System Components\n1. **Physicochemical Life Support (PLS)**\n   - **Thermophilic Bacteria for Waste Liquefacti",
+      "score": 0.5917142857142857
+    },
+    {
+      "task_id": "aero_easy_debris_mitigation",
+      "task_name": "Space Debris Mitigation Strategies",
+      "difficulty": "easy",
+      "answer": "### Space Debris Mitigation Strategies\n\n#### Current Challenges and Active Debris Removal Technologies\n\n**Current Challenges:**\n1. **Kessler Syndrome Risk:** Over 36,500 objects >10 cm in LEO reach critical densities, leading to significant collisions that threaten orbit stability.\n2. **Tracking Capabilities:** Limited ground-based and orbital tracking capabilities, especially for larger objects like satellites and asteroids.\n3. **Proposed Removal Methods:**\n   - **Electrodynamic Tethers:** Gene",
+      "score": 0.6334285714285715
+    }
+  ],
+  "trained": [
+    {
+      "task_id": "aero_hard_hypersonic_vehicle",
+      "task_name": "Reusable Hypersonic Space Access Vehicle",
+      "difficulty": "hard",
+      "answer": "### Conceptual Design Overview\n\n#### Introduction\nReusability of a hypersonic space access vehicle (SAV) with a combined-cycle propulsion (CCP) system is crucial for reducing operational costs, increasing payload capacity, and enhancing safety. The proposed SAV utilizes a combination of scramjet, turbine, and rocket propulsion modes, aiming to achieve high-speed, long-duration flights within the hypersonic regime. This approach addresses key aerothermodynamic challenges such as shock-boundary la",
+      "score": 0.5214285714285715
+    },
+    {
+      "task_id": "aero_easy_propulsion_comparison",
+      "task_name": "Compare Propulsion Technologies",
+      "difficulty": "easy",
+      "answer": "### Comparison of Propulsion Technologies for a Mars Transit Mission\n\n#### Specific Impulse\n**Ion Propulsion (IP):**\n- **Specific Impulse:** 4,190 seconds\n- **Thrust Level:** Extremely Low (0.5 N)\n- **Impact on Mission Duration:** Minimal impact on mission duration as IP is designed for short-duration missions.\n- **Technology Readiness:** High due to its ability to achieve high specific impulses without significant technological hurdles.\n\n**Nuclear Thermal Propulsion (NTP):**\n- **Specific Impuls",
+      "score": 0.5616190476190477
+    },
+    {
+      "task_id": "aero_medium_mars_edl",
+      "task_name": "Mars EDL Architecture Design",
+      "difficulty": "medium",
+      "answer": "### Overview of the EDL Architecture Design\n\n#### Introduction\nThe EDL architecture design addresses the challenges posed by the Mars atmospheric environment, including the high peak heat flux and the need for effective aerodynamic deceleration of heavy payloads. This design integrates multiple phases into a single, cohesive approach to ensure efficient and safe mission execution. The proposed architecture includes:\n\n1. **Aerocapture**: The first phase involves capturing the payload using aerobr",
+      "score": 0.5678095238095239
+    },
+    {
+      "task_id": "aero_medium_life_support",
+      "task_name": "Deep Space Life Support Design",
+      "difficulty": "medium",
+      "answer": "### Overview of the Integrated Life Support System Design\n\n#### Introduction\nDeep space exploration presents unique challenges due to the extreme conditions encountered, including low gravity, radiation exposure, and the presence of hazardous materials like water and carbon dioxide. To ensure long-term survival and minimize environmental impact, a sophisticated life support system must be designed. This design integrates various biological processes and energy sources to maximize closure ratios ",
+      "score": 0.5897142857142857
+    },
+    {
+      "task_id": "aero_easy_debris_mitigation",
+      "task_name": "Space Debris Mitigation Strategies",
+      "difficulty": "easy",
+      "answer": "**Task: Space Debris Mitigation Strategies**\n\n**Current Challenges and Evaluation of Active Debris Removal Technologies**\n\nActive debris removal (ADR) technologies have been developed to mitigate the effects of space debris on orbiting satellites and other spacecraft. These technologies aim to prevent collisions that could lead to catastrophic damage to spacecraft and disrupt satellite communications. However, the effectiveness of these technologies varies widely depending on several factors suc",
+      "score": 0.6894285714285716
+    }
+  ],
+  "summary": {
+    "baseline_mean": 0.5579809523809524,
+    "trained_mean": 0.5860000000000001,
+    "improvement": 0.02801904761904772
+  }
+}

plots/score_distribution.png ADDED Viewed

Git LFS Details

SHA256: d692edf8c4b2b23fbb12efc4ac98580e4f0b1cd526c22472a59625247e613778
Pointer size: 130 Bytes
Size of remote file: 36.9 kB

plots/training_curves.png ADDED Viewed

Git LFS Details

SHA256: 7d2645fe02ed361c359ca922be193ba33723e89d23162b260d1452eed27cc361
Pointer size: 131 Bytes
Size of remote file: 124 kB

train.py ADDED Viewed

	@@ -0,0 +1,227 @@

+#!/usr/bin/env python3
+"""
+train.py — GRPO Fine-Tuning for Agentic RAG Gym (Aerospace Domain)
+====================================================================
+End-to-end training script that:
+1. Connects to the live gym environment to collect prompts
+2. Loads Qwen2.5-0.5B-Instruct with LoRA
+3. Trains with GRPO (Group Relative Policy Optimization) via TRL
+4. Evaluates baseline vs. trained model with domain graders
+5. Generates publication-quality plots
+6. Pushes the fine-tuned model to Hugging Face Hub
+Usage:
+    # Start the environment first:
+    python main.py &
+    # Then run training:
+    python train.py
+Environment variables (loaded from .env):
+    HF_TOKEN          Hugging Face token (for model push)
+    HF_USERNAME       Hugging Face username (default: williyam)
+    ENV_URL           Gym environment URL (default: http://localhost:7860)
+    BASE_MODEL_ID     Base model (default: Qwen/Qwen2.5-0.5B-Instruct)
+"""
+from __future__ import annotations
+import sys
+import time
+import numpy as np
+import torch
+from dotenv import load_dotenv
+from peft import LoraConfig, TaskType
+load_dotenv()
+from training.config import (
+    BASE_MODEL_ID,
+    CHECKPOINTS_DIR,
+    FINETUNED_MODEL_ID,
+    HF_TOKEN,
+    PLOTS_DIR,
+)
+from training.dataset import SYSTEM_PROMPT, build_grpo_dataset
+from training.evaluate import (
+    evaluate_model_on_tasks,
+    plot_baseline_vs_trained,
+    plot_reward_distribution,
+    plot_training_curves,
+    save_eval_results,
+)
+from training.reward import grade_answer_sync
+# ── Device ─────────────────────────────────────────────────────────────
+if torch.backends.mps.is_available():
+    DEVICE = "mps"
+elif torch.cuda.is_available():
+    DEVICE = "cuda"
+else:
+    DEVICE = "cpu"
+print(f"Device: {DEVICE}  |  PyTorch {torch.__version__}")
+# ── Build dataset ──────────────────────────────────────────────────────
+print("\n[1/6] Collecting prompts from environment...")
+dataset = build_grpo_dataset(num_per_task=8, seed=42)
+task_ids_map = {}
+for row in dataset:
+    task_ids_map[row["task_id"]] = row["task_name"]
+# Format for TRL: prompt column must be list of message dicts
+def format_for_trl(example):
+    return {
+        "prompt": [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": example["prompt"]},
+        ],
+    }
+train_dataset = dataset.map(format_for_trl, remove_columns=["task_name", "difficulty"])
+# ── Load model ─────────────────────────────────────────────────────────
+print(f"\n[2/6] Loading model: {BASE_MODEL_ID}")
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID, trust_remote_code=True, padding_side="left")
+if tokenizer.pad_token is None:
+    tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForCausalLM.from_pretrained(
+    BASE_MODEL_ID, torch_dtype=torch.float32, trust_remote_code=True,
+)
+peft_config = LoraConfig(
+    task_type=TaskType.CAUSAL_LM,
+    r=16, lora_alpha=32, lora_dropout=0.05,
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+    bias="none",
+)
+# ── Baseline evaluation ───────────────────────────────────────────────
+print("\n[3/6] Evaluating baseline (before training)...")
+from peft import get_peft_model
+eval_model = get_peft_model(model.to(DEVICE), peft_config)
+eval_model.print_trainable_parameters()
+eval_prompts = []
+seen = set()
+for row in dataset:
+    if row["task_id"] not in seen:
+        eval_prompts.append(row)
+        seen.add(row["task_id"])
+baseline_results = evaluate_model_on_tasks(
+    eval_model, tokenizer, eval_prompts, max_new_tokens=512, temperature=0.1,
+)
+baseline_mean = np.mean([r["score"] for r in baseline_results])
+print(f"  Baseline mean score: {baseline_mean:.4f}")
+del eval_model
+model = model.to("cpu")
+# ── Reward function ────────────────────────────────────────────────────
+def reward_fn(completions, **kwargs):
+    """Score completions using domain graders."""
+    rewards = []
+    prompts = kwargs.get("prompts", [])
+    for i, completion in enumerate(completions):
+        text = completion[0]["content"] if isinstance(completion, list) else str(completion)
+        text = text.strip()
+        if len(text) < 10:
+            rewards.append(0.01)
+            continue
+        task_id = None
+        if i < len(prompts):
+            p = prompts[i]
+            if isinstance(p, list):
+                p = " ".join(m.get("content", "") for m in p)
+            for tid, name in task_ids_map.items():
+                if name in str(p):
+                    task_id = tid
+                    break
+        if task_id is None:
+            task_id = list(task_ids_map.keys())[0]
+        try:
+            rewards.append(float(grade_answer_sync(task_id, text)))
+        except Exception:
+            rewards.append(0.01)
+    return rewards
+# ── GRPO Training ──────────────────────────────────────────────────────
+print("\n[4/6] Starting GRPO training...")
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    output_dir=str(CHECKPOINTS_DIR / "grpo"),
+    num_train_epochs=2,
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=4,
+    learning_rate=5e-6,
+    warmup_ratio=0.1,
+    max_grad_norm=1.0,
+    logging_steps=1,
+    save_steps=50,
+    save_total_limit=2,
+    bf16=False,
+    fp16=False,
+    seed=42,
+    remove_unused_columns=False,
+    num_generations=4,
+    max_completion_length=512,
+    temperature=0.7,
+    use_vllm=False,
+    report_to="none",
+    optim="adamw_torch",
+    gradient_checkpointing=True,
+    log_completions=True,
+    num_completions_to_print=1,
+)
+trainer = GRPOTrainer(
+    model=BASE_MODEL_ID,
+    reward_funcs=reward_fn,
+    args=training_args,
+    train_dataset=train_dataset,
+    peft_config=peft_config,
+    processing_class=tokenizer,
+)
+t0 = time.time()
+trainer.train()
+elapsed = time.time() - t0
+print(f"\nTraining completed in {elapsed/60:.1f} min")
+# ── Post-training evaluation ──────────────────────────────────────────
+print("\n[5/6] Evaluating trained model...")
+trained_results = evaluate_model_on_tasks(
+    trainer.model, tokenizer, eval_prompts, max_new_tokens=512, temperature=0.1,
+)
+trained_mean = np.mean([r["score"] for r in trained_results])
+print(f"  Trained mean score: {trained_mean:.4f}")
+print(f"  Improvement: {trained_mean - baseline_mean:+.4f}")
+# ── Plots + save ──────────────────────────────────────────────────────
+print("\n[6/6] Generating plots and saving...")
+log_history = trainer.state.log_history if hasattr(trainer, "state") else []
+plot_training_curves(log_history)
+plot_baseline_vs_trained(baseline_results, trained_results)
+plot_reward_distribution(baseline_results, trained_results)
+save_eval_results(baseline_results, trained_results)
+# Push
+if HF_TOKEN:
+    print(f"\nPushing to HF Hub: {FINETUNED_MODEL_ID}")
+    trainer.model.push_to_hub(FINETUNED_MODEL_ID, token=HF_TOKEN, private=False)
+    tokenizer.push_to_hub(FINETUNED_MODEL_ID, token=HF_TOKEN, private=False)
+    print("Model pushed successfully")
+print(f"\n{'='*50}")
+print(f"  Base model:     {BASE_MODEL_ID}")
+print(f"  Training time:  {elapsed/60:.1f} min")
+print(f"  Baseline score: {baseline_mean:.4f}")
+print(f"  Trained score:  {trained_mean:.4f}")
+print(f"  Delta:          {trained_mean - baseline_mean:+.4f}")
+print(f"{'='*50}")

training/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Training utilities for Agentic RAG Gym GRPO fine-tuning."""

training/config.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+Configuration for GRPO training of Agentic RAG Gym.
+All secrets and tunables are loaded from environment variables / .env file.
+"""
+from __future__ import annotations
+import os
+from pathlib import Path
+from dotenv import load_dotenv
+load_dotenv()
+# ---------------------------------------------------------------------------
+# Paths
+# ---------------------------------------------------------------------------
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
+PLOTS_DIR = PROJECT_ROOT / "plots"
+PLOTS_DIR.mkdir(exist_ok=True)
+CHECKPOINTS_DIR = PROJECT_ROOT / "checkpoints"
+CHECKPOINTS_DIR.mkdir(exist_ok=True)
+# ---------------------------------------------------------------------------
+# Environment / secrets
+# ---------------------------------------------------------------------------
+HF_TOKEN: str = os.getenv("HF_TOKEN", "")
+HF_USERNAME: str = os.getenv("HF_USERNAME", "williyam")
+# Environment server (our FastAPI gym)
+ENV_URL: str = os.getenv("ENV_URL", "http://localhost:7860")
+# ---------------------------------------------------------------------------
+# Model configuration
+# ---------------------------------------------------------------------------
+BASE_MODEL_ID: str = os.getenv("BASE_MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
+FINETUNED_MODEL_ID: str = os.getenv(
+    "FINETUNED_MODEL_ID",
+    f"{HF_USERNAME}/agentic-rag-aerospace-grpo",
+)

training/dataset.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Dataset builder for GRPO training.
+Connects to the live Agentic RAG Gym environment to build training prompts.
+Each prompt = system instruction + task description + retrieved documents.
+"""
+from __future__ import annotations
+import asyncio
+import re
+from typing import Any, Dict, List
+import httpx
+from datasets import Dataset
+from training.config import ENV_URL
+SYSTEM_PROMPT = (
+    "You are an expert aerospace research analyst with deep knowledge of "
+    "propulsion systems, orbital mechanics, materials science, thermal protection, "
+    "life support systems, and space mission design. When analyzing aerospace topics:\n"
+    "- Cite specific data points and numerical values from provided documents\n"
+    "- Structure your analysis with clear sections\n"
+    "- Compare alternatives with quantitative evidence\n"
+    "- Provide actionable recommendations grounded in engineering constraints\n"
+)
+def _format_prompt(task: Dict[str, Any], docs: List[Dict[str, Any]]) -> str:
+    """Build a user message from a task + retrieved docs."""
+    doc_text = ""
+    for i, doc in enumerate(docs, 1):
+        src = doc.get("source", "unknown")
+        doc_text += f"\n[Document {i} -- {src}]\n{doc['content']}\n"
+    return (
+        f"## Task: {task['name']}\n\n"
+        f"{task['description']}\n\n"
+        f"### Retrieved Reference Documents\n{doc_text}\n"
+        "### Instructions\n"
+        "Provide a comprehensive, well-structured answer to the task above. "
+        "Cite specific data from the reference documents. "
+        "Include quantitative evidence and clear recommendations."
+    )
+async def _fetch_tasks(client: httpx.AsyncClient) -> List[Dict[str, Any]]:
+    resp = await client.get(f"{ENV_URL}/tasks")
+    resp.raise_for_status()
+    return resp.json()["tasks"]
+def _query_variants(task: Dict[str, Any]) -> List[str]:
+    """Generate diverse retrieval queries from a task description."""
+    desc = task["description"]
+    sentences = [s.strip() for s in re.split(r'[.!?]+', desc) if len(s.strip()) > 20]
+    variants = [desc]
+    variants.extend(sentences[:4])
+    name_words = task["name"].lower()
+    variants.append(name_words)
+    return variants
+async def _collect_one_prompt(
+    client: httpx.AsyncClient,
+    task: Dict[str, Any],
+    query: str,
+) -> Dict[str, Any] | None:
+    """Reset env, retrieve docs, build a prompt."""
+    resp = await client.post(f"{ENV_URL}/reset", json={"task_id": task["task_id"]})
+    if resp.status_code != 200:
+        return None
+    resp = await client.post(
+        f"{ENV_URL}/step", json={"type": "retrieve", "query": query}
+    )
+    if resp.status_code != 200:
+        return None
+    docs = resp.json()["observation"]["retrieved_docs"]
+    user_msg = _format_prompt(task, docs)
+    return {
+        "task_id": task["task_id"],
+        "task_name": task["name"],
+        "difficulty": task.get("difficulty", "easy"),
+        "prompt": user_msg,
+    }
+async def _build_dataset_async(num_per_task: int = 8) -> List[Dict[str, Any]]:
+    records: List[Dict[str, Any]] = []
+    async with httpx.AsyncClient(timeout=120.0) as client:
+        tasks = await _fetch_tasks(client)
+        for task in tasks:
+            variants = _query_variants(task)
+            for i in range(num_per_task):
+                query = variants[i % len(variants)]
+                rec = await _collect_one_prompt(client, task, query)
+                if rec:
+                    records.append(rec)
+    return records
+def build_grpo_dataset(num_per_task: int = 8, seed: int = 42) -> Dataset:
+    """
+    Build a HuggingFace Dataset of prompts for GRPO training.
+    Each row: prompt (str), task_id, task_name, difficulty.
+    Prompts are collected from the LIVE environment (reset + retrieve).
+    """
+    records = asyncio.run(_build_dataset_async(num_per_task))
+    if not records:
+        raise RuntimeError(f"No prompts collected. Is the environment running at {ENV_URL}?")
+    ds = Dataset.from_list(records)
+    ds = ds.shuffle(seed=seed)
+    print(f"Built GRPO dataset: {len(ds)} prompts across {len(ds.unique('task_id'))} tasks")
+    return ds

training/evaluate.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""
+Evaluation & plotting utilities for GRPO training.
+"""
+from __future__ import annotations
+import json
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+from training.config import PLOTS_DIR
+from training.reward import grade_answer_sync
+# ── Model evaluation ────────────────────────────────────────────────────
+def evaluate_model_on_tasks(
+    model,
+    tokenizer,
+    prompts: List[Dict[str, Any]],
+    max_new_tokens: int = 512,
+    temperature: float = 0.1,
+) -> List[Dict[str, Any]]:
+    """Generate answers for each prompt and grade them with domain graders."""
+    import torch
+    results: List[Dict[str, Any]] = []
+    device = next(model.parameters()).device
+    for item in prompts:
+        messages = [
+            {"role": "system", "content": "You are an expert aerospace research analyst. Provide comprehensive answers citing specific data."},
+            {"role": "user", "content": item["prompt"]},
+        ]
+        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        with torch.no_grad():
+            output_ids = model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                temperature=max(temperature, 0.01),
+                do_sample=temperature > 0,
+                top_p=0.9,
+                pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
+            )
+        generated = output_ids[0][inputs["input_ids"].shape[1]:]
+        answer = tokenizer.decode(generated, skip_special_tokens=True).strip()
+        score = grade_answer_sync(item["task_id"], answer)
+        results.append({
+            "task_id": item["task_id"],
+            "task_name": item["task_name"],
+            "difficulty": item["difficulty"],
+            "answer": answer[:500],
+            "score": score,
+        })
+        print(f"  [{item['task_id']}] score={score:.3f}  len={len(answer)}")
+    return results
+# ── Plotting ─────────────────────────────────────────────────────────────
+def plot_training_curves(log_history: List[Dict[str, Any]], out_dir: Path = PLOTS_DIR) -> Path:
+    """Plot training loss and reward curves. Returns path to saved figure."""
+    steps = [e["step"] for e in log_history if "loss" in e]
+    losses = [e["loss"] for e in log_history if "loss" in e]
+    reward_steps = [e["step"] for e in log_history if "reward" in e or "reward/mean" in e]
+    rewards = [e.get("reward/mean", e.get("reward")) for e in log_history if "reward" in e or "reward/mean" in e]
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+    ax = axes[0]
+    if steps and losses:
+        ax.plot(steps, losses, color="#D4AF37", linewidth=2)
+    ax.set_xlabel("Training Step", fontsize=12)
+    ax.set_ylabel("Loss", fontsize=12)
+    ax.set_title("GRPO Training Loss", fontsize=14, fontweight="bold")
+    ax.grid(True, alpha=0.3)
+    ax = axes[1]
+    if reward_steps and rewards:
+        ax.plot(reward_steps, rewards, color="#4CAF50", linewidth=2)
+    ax.set_xlabel("Training Step", fontsize=12)
+    ax.set_ylabel("Mean Reward (grader score)", fontsize=12)
+    ax.set_title("GRPO Mean Reward", fontsize=14, fontweight="bold")
+    ax.grid(True, alpha=0.3)
+    plt.tight_layout()
+    path = out_dir / "training_curves.png"
+    fig.savefig(path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+    print(f"Saved training curves -> {path}")
+    return path
+def plot_baseline_vs_trained(
+    baseline_results: List[Dict[str, Any]],
+    trained_results: List[Dict[str, Any]],
+    out_dir: Path = PLOTS_DIR,
+) -> Path:
+    """Bar chart comparing baseline vs trained scores per task."""
+    def _agg(results):
+        sums: Dict[str, List[float]] = defaultdict(list)
+        for r in results:
+            sums[r["task_id"]].append(r["score"])
+        return {k: float(np.mean(v)) for k, v in sums.items()}
+    baseline_agg = _agg(baseline_results)
+    trained_agg = _agg(trained_results)
+    tasks = sorted(set(baseline_agg) | set(trained_agg))
+    short_names = [t.replace("aero_", "").replace("_", " ").title() for t in tasks]
+    x = np.arange(len(tasks))
+    width = 0.35
+    fig, ax = plt.subplots(figsize=(12, 6))
+    bars1 = ax.bar(x - width / 2, [baseline_agg.get(t, 0) for t in tasks],
+                   width, label="Baseline (untrained)", color="#8B0000", alpha=0.85, edgecolor="black")
+    bars2 = ax.bar(x + width / 2, [trained_agg.get(t, 0) for t in tasks],
+                   width, label="GRPO-trained", color="#D4AF37", alpha=0.85, edgecolor="black")
+    ax.set_xlabel("Task", fontsize=12)
+    ax.set_ylabel("Grader Score (0-1)", fontsize=12)
+    ax.set_title("Baseline vs. GRPO-Trained Model - Task Scores", fontsize=14, fontweight="bold")
+    ax.set_xticks(x)
+    ax.set_xticklabels(short_names, rotation=20, ha="right", fontsize=10)
+    ax.legend(fontsize=11)
+    ax.set_ylim(0, 1.0)
+    ax.grid(axis="y", alpha=0.3)
+    for bar in list(bars1) + list(bars2):
+        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.02,
+                f"{bar.get_height():.2f}", ha="center", fontsize=9)
+    plt.tight_layout()
+    path = out_dir / "baseline_vs_trained.png"
+    fig.savefig(path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+    print(f"Saved comparison chart -> {path}")
+    return path
+def plot_reward_distribution(
+    baseline_results: List[Dict[str, Any]],
+    trained_results: List[Dict[str, Any]],
+    out_dir: Path = PLOTS_DIR,
+) -> Path:
+    """Histogram of score distributions."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    bins = np.linspace(0, 1, 21)
+    ax.hist([r["score"] for r in baseline_results], bins=bins, alpha=0.6,
+            label="Baseline", color="#8B0000", edgecolor="black")
+    ax.hist([r["score"] for r in trained_results], bins=bins, alpha=0.6,
+            label="GRPO-trained", color="#D4AF37", edgecolor="black")
+    ax.set_xlabel("Grader Score", fontsize=12)
+    ax.set_ylabel("Frequency", fontsize=12)
+    ax.set_title("Score Distribution - Baseline vs. GRPO-Trained", fontsize=14, fontweight="bold")
+    ax.legend(fontsize=11)
+    ax.grid(axis="y", alpha=0.3)
+    plt.tight_layout()
+    path = out_dir / "score_distribution.png"
+    fig.savefig(path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+    print(f"Saved distribution plot -> {path}")
+    return path
+def save_eval_results(
+    baseline_results: List[Dict[str, Any]],
+    trained_results: List[Dict[str, Any]],
+    out_dir: Path = PLOTS_DIR,
+) -> Path:
+    """Save evaluation results as JSON."""
+    data = {
+        "baseline": baseline_results,
+        "trained": trained_results,
+        "summary": {
+            "baseline_mean": float(np.mean([r["score"] for r in baseline_results])) if baseline_results else 0,
+            "trained_mean": float(np.mean([r["score"] for r in trained_results])) if trained_results else 0,
+            "improvement": float(np.mean([r["score"] for r in trained_results]) - np.mean([r["score"] for r in baseline_results])) if baseline_results and trained_results else 0,
+        },
+    }
+    path = out_dir / "eval_results.json"
+    path.write_text(json.dumps(data, indent=2))
+    print(f"Saved eval results -> {path}")
+    return path

training/reward.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+Reward functions for GRPO training.
+Uses the real domain graders from the Agentic RAG Gym so the RL signal
+matches the actual evaluation rubric — not a proxy.
+"""
+from __future__ import annotations
+import asyncio
+from datetime import datetime, timezone
+from typing import Any, Dict, List
+from domains.aerospace.graders import GRADER_REGISTRY
+from rag_master.models import EpisodeState, StepRecord, Trajectory
+from rag_master.rewards import _SCORE_MIN
+def _make_dummy_state(task_id: str, answer: str) -> EpisodeState:
+    """Minimal EpisodeState for offline grading."""
+    from domains.aerospace.config import AerospaceDomainConfig
+    domain = AerospaceDomainConfig()
+    tasks = {t.task_id: t for t in domain.get_tasks()}
+    task = tasks.get(task_id)
+    if task is None:
+        raise ValueError(f"Unknown task_id: {task_id}")
+    return EpisodeState(
+        episode_id="grpo-eval",
+        task=task,
+        current_step=5,
+        query_history=["query"],
+        retrieved_docs=[],
+        agent_messages=[],
+        generated_answer=answer,
+        intermediate_rewards=[0.5] * 5,
+        done=True,
+        info={},
+    )
+def _make_dummy_trajectory(task_id: str) -> Trajectory:
+    """Minimal trajectory with a good action sequence for process scoring."""
+    now = datetime.now(timezone.utc)
+    steps = [
+        StepRecord(step_index=0, action_type="plan", action_payload={},
+                   observation_summary="planned", intermediate_reward=0.5,
+                   reasoning_trace="Planning approach.", timestamp=now),
+        StepRecord(step_index=1, action_type="retrieve", action_payload={},
+                   observation_summary="retrieved", intermediate_reward=0.6,
+                   reasoning_trace="Retrieving.", timestamp=now),
+        StepRecord(step_index=2, action_type="reason", action_payload={},
+                   observation_summary="reasoned", intermediate_reward=0.5,
+                   reasoning_trace="Analyzing because data is relevant.", timestamp=now),
+        StepRecord(step_index=3, action_type="answer", action_payload={},
+                   observation_summary="answered", intermediate_reward=0.6,
+                   reasoning_trace="Final answer.", timestamp=now),
+        StepRecord(step_index=4, action_type="verify", action_payload={},
+                   observation_summary="verified", intermediate_reward=0.5,
+                   reasoning_trace="Verifying.", timestamp=now),
+    ]
+    return Trajectory(
+        episode_id="grpo-eval", task_id=task_id, steps=steps,
+        total_reward=0.0, final_score=0.0, completed=True, metadata={},
+    )
+def grade_answer_sync(task_id: str, answer: str) -> float:
+    """Grade a single answer using the domain grader (synchronous)."""
+    grader_cls = GRADER_REGISTRY.get(task_id)
+    if grader_cls is None:
+        return float(_SCORE_MIN)
+    grader = grader_cls()
+    state = _make_dummy_state(task_id, answer)
+    trajectory = _make_dummy_trajectory(task_id)
+    try:
+        loop = asyncio.get_running_loop()
+    except RuntimeError:
+        loop = None
+    if loop and loop.is_running():
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            return pool.submit(asyncio.run, grader.grade(state, trajectory)).result()
+    return asyncio.run(grader.grade(state, trajectory))