A high-fidelity reinforcement-learning environment where agents read structured telemetry, choose the right remediation, and recover from failures within an SLA budget — across 22 specialised tasks.
22 specialised SRE scenarios across three difficulty tiers. Click one to launch it in the playground.
Real performance data from this SRE environment · Trained heuristic policy vs. Qwen2.5-7B zero-shot via HF Inference Router · per-task scores from leaderboard_results.json
NetWeaver is not a chatbot dressed up as an engineer. It's a stateful environment where agents diagnose real failures, execute real remediations, and get scored by a deterministic rubric — across 22 hand-crafted incidents, three difficulty tiers, and a 15-step SLA budget. We ran 18+ training experiments. We logged every collapse and every breakthrough. This page is the receipts.
Most agent demos let the model "say something plausible." Ours forces it to take a tool call, change cluster state, and live with the consequence. Every step is a state transition.
OpenEnv-compatible rubrics in graders.py. No vibes-based scoring. No reward hacking — we wrote test_reward_hacking.py to seal the side doors.
Our trained heuristic policy beats Qwen2.5-7B zero-shot. Our v14 SmolLM2-1.7B leapt 0.701 → 0.927 with GRPO. Compute isn't destiny.
The full feedback loop runs server-side. The model never sees raw logs — it sees structured observations and emits structured tool calls.
obs = env.reset(task_id) → action = π(obs) → obs', r, done = env.step(action) → r' = shape(r, rubric)
We did not get to the leaderboard on the first try. Here is the actual log of every run, ranked. Wins are green, regressions are red.
Same 22 tasks. Same grader. The trained heuristic — far smaller — wins on every difficulty tier.
Trained · OUR RL POLICY
Zero-shot · HF Inference Router
SmolLM2-1.7B (v14) leapt from 0.701 → 0.927 with GRPO. Two 7B+ runs collapsed. Architecture, recipe, and reward shaping mattered more than parameter count.
Our first reward function had a loophole — the model learned to spam RUN_MINI_ITERATION for free credit. We added test_reward_hacking.py and the leaderboard got real.
For T15+ tasks, faulting nodes don't pulse red. The agent must analyse gradient flux and call RUN_MINI_ITERATION to find silent NaN contagion. Most zero-shot LLMs cannot do this.
Pick a mission. Diagnose the cluster. Beat 1.000.