Live · Cluster healthy
Live cluster · 100 GPU nodes

Autonomous SRE for the AI infrastructure era.

A high-fidelity reinforcement-learning environment where agents read structured telemetry, choose the right remediation, and recover from failures within an SLA budget — across 22 specialised tasks.

Reward signal
0.989
22
Mission Scenarios
100
GPU Nodes Simulated
0.989
Best Avg. Reward
15
Step SLA Budget

Mission catalogue

22 specialised SRE scenarios across three difficulty tiers. Click one to launch it in the playground.

T01
Node Offline Triage
Easy node_offline
Step 0 / 15
System topology 0 faulting · 100 GPU nodes
Healthy Faulting Degraded
Gradient flux live
VRAM allocation 8 sub-clusters

MODEL ANALYTICS

Real performance data from this SRE environment · Trained heuristic policy vs. Qwen2.5-7B zero-shot via HF Inference Router · per-task scores from leaderboard_results.json

🏅
TESTED MODELS LEADERBOARD
🤖
GRPO TRAINING RUNS
🕸️
RADAR: TOP 5
📈
SCORE DISTRIBUTION
EFFICIENCY: SCORE / PARAM
🎯
TASK DIFFICULTY RANKING
🤖
GRPO TRAINING PORTFOLIO — Δ (after − before) per run
The story behind the model

An autonomous SRE trained to keep a 100-node cluster alive.

NetWeaver is not a chatbot dressed up as an engineer. It's a stateful environment where agents diagnose real failures, execute real remediations, and get scored by a deterministic rubric — across 22 hand-crafted incidents, three difficulty tiers, and a 15-step SLA budget. We ran 18+ training experiments. We logged every collapse and every breakthrough. This page is the receipts.

22
Hand-crafted missions
100
GPU nodes simulated
18
GRPO experiments run
0.989
Best avg. reward
The mission

Three things we wanted to prove.

⚙️

Real environments > vibes

Most agent demos let the model "say something plausible." Ours forces it to take a tool call, change cluster state, and live with the consequence. Every step is a state transition.

📐

Verifiable rewards

OpenEnv-compatible rubrics in graders.py. No vibes-based scoring. No reward hacking — we wrote test_reward_hacking.py to seal the side doors.

🏁

Small models can win

Our trained heuristic policy beats Qwen2.5-7B zero-shot. Our v14 SmolLM2-1.7B leapt 0.701 → 0.927 with GRPO. Compute isn't destiny.

How it works

Agent → Environment → Grader → Reward.

The full feedback loop runs server-side. The model never sees raw logs — it sees structured observations and emits structured tool calls.

AGENT
Policy πθ
Reads obs → emits tool call
ENV
100-node cluster
netweaver_sre_environment.py
GRADER
Rubric stack
graders.py + rubrics.py
REWARD
Shaped scalar
reward_shaper.py · 0–1
// inner loop obs = env.reset(task_id) → action = π(obs) → obs', r, done = env.step(action) → r' = shape(r, rubric)
The training saga

18 GRPO runs. Two collapses. Three winners.

We did not get to the leaderboard on the first try. Here is the actual log of every run, ranked. Wins are green, regressions are red.

The showdown

Heuristic ε-decay vs Qwen2.5-7B (zero-shot).

Same 22 tasks. Same grader. The trained heuristic — far smaller — wins on every difficulty tier.

#1

Heuristic ε-decay

Trained · OUR RL POLICY

0.989
  • 22 / 22 resolved
  • 30 eval episodes
  • ≈ 100 k training steps
#2

Qwen2.5-7B-Instruct

Zero-shot · HF Inference Router

0.924
  • 21 / 22 resolved
  • 1 eval episode each
  • Task hints enabled
The lessons

Three things we learned the hard way.

01

Compute isn't destiny.

SmolLM2-1.7B (v14) leapt from 0.701 → 0.927 with GRPO. Two 7B+ runs collapsed. Architecture, recipe, and reward shaping mattered more than parameter count.

02

Anti-hacking is half the battle.

Our first reward function had a loophole — the model learned to spam RUN_MINI_ITERATION for free credit. We added test_reward_hacking.py and the leaderboard got real.

03

Hard mode breaks "guess" agents.

For T15+ tasks, faulting nodes don't pulse red. The agent must analyse gradient flux and call RUN_MINI_ITERATION to find silent NaN contagion. Most zero-shot LLMs cannot do this.

Ready to play on-call?

Pick a mission. Diagnose the cluster. Beat 1.000.