The story behind the model

An autonomous SRE trained to keep a 100-node cluster alive.

NetWeaver is not a chatbot dressed up as an engineer. It's a stateful environment where agents diagnose real failures, execute real remediations, and get scored by a deterministic rubric — across 22 hand-crafted incidents, three difficulty tiers, and a 15-step SLA budget. We ran 18+ training experiments. We logged every collapse and every breakthrough. This page is the receipts.

Hand-crafted missions

100

GPU nodes simulated

GRPO experiments run

0.989

Best avg. reward

The mission

Three things we wanted to prove.

⚙️

Real environments > vibes

Most agent demos let the model "say something plausible." Ours forces it to take a tool call, change cluster state, and live with the consequence. Every step is a state transition.

📐

Verifiable rewards

OpenEnv-compatible rubrics in graders.py. No vibes-based scoring. No reward hacking — we wrote test_reward_hacking.py to seal the side doors.

🏁

Small models can win

Our trained heuristic policy beats Qwen2.5-7B zero-shot. Our v14 SmolLM2-1.7B leapt 0.701 → 0.927 with GRPO. Compute isn't destiny.

How it works

Agent → Environment → Grader → Reward.

The full feedback loop runs server-side. The model never sees raw logs — it sees structured observations and emits structured tool calls.

AGENT

Policy π_θ

Reads obs → emits tool call

ENV

100-node cluster

netweaver_sre_environment.py

GRADER

Rubric stack

graders.py + rubrics.py

REWARD

Shaped scalar

reward_shaper.py · 0–1

// inner loop obs = env.reset(task_id) → action = π(obs) → obs', r, done = env.step(action) → r' = shape(r, rubric)

The training saga

18 GRPO runs. Two collapses. Three winners.

We did not get to the leaderboard on the first try. Here is the actual log of every run, ranked. Wins are green, regressions are red.

The showdown

Heuristic ε-decay vs Qwen2.5-7B (zero-shot).

Same 22 tasks. Same grader. The trained heuristic — far smaller — wins on every difficulty tier.

Heuristic ε-decay

Trained · OUR RL POLICY

0.989

22 / 22 resolved
30 eval episodes
≈ 100 k training steps

Qwen2.5-7B-Instruct

Zero-shot · HF Inference Router

0.924

21 / 22 resolved
1 eval episode each
Task hints enabled

The lessons

Three things we learned the hard way.

Compute isn't destiny.

SmolLM2-1.7B (v14) leapt from 0.701 → 0.927 with GRPO. Two 7B+ runs collapsed. Architecture, recipe, and reward shaping mattered more than parameter count.

Anti-hacking is half the battle.

Our first reward function had a loophole — the model learned to spam RUN_MINI_ITERATION for free credit. We added test_reward_hacking.py and the leaderboard got real.

Hard mode breaks "guess" agents.

For T15+ tasks, faulting nodes don't pulse red. The agent must analyse gradient flux and call RUN_MINI_ITERATION to find silent NaN contagion. Most zero-shot LLMs cannot do this.

Ready to play on-call?

Pick a mission. Diagnose the cluster. Beat 1.000.

Autonomous SRE for the AI infrastructure era.

Mission catalogue

MODEL ANALYTICS