Running AI Sub-Agents for Free: How We Cut the "Thinking Tax" to Zero
Running AI Sub-Agents for Free: How We Cut the "Thinking Tax" to Zero
There's a cost problem hiding inside every multi-agent AI system, and most people don't notice it until they get the invoice.
When you run a sophisticated AI agent stack (one where the main model spins up sub-agents to handle research, code review, content drafts, data lookups) every one of those sub-tasks costs money. And if you've defaulted to routing those sub-tasks through a premium reasoning model, you're paying reasoning-model prices for work that doesn't need reasoning-model intelligence.
I call it the thinking tax: the overhead you pay when you send every errand to your most expensive employee.
At Hawkwork, we run OpenClaw as our AI operations backbone. After evaluating the sub-agent cost structure, we made a simple infrastructure change: route all sub-agent calls through OpenRouter's free tier, using NVIDIA's Nemotron 3 Super as the primary model with automatic fallbacks. The result is a system that runs more tasks, at zero marginal cost, without meaningfully compromising quality on the work sub-agents actually do.
Here's how it works.
The Setup: What Sub-Agents Do and Why Model Selection Matters
OpenClaw is an AI agent runtime that handles orchestration, memory, tool access, and multi-agent routing. The main agent (the one talking to you, reading context, making decisions) runs on a premium model. That's appropriate. Orchestration is complex and benefits from a capable model.
Sub-agents are different. They handle discrete, bounded tasks: summarize this document, check this URL, draft this section, run this search. These tasks are real work, but they don't require the same depth of reasoning as the orchestrator. Paying orchestrator prices for sub-agent work is like hiring a senior architect to hang shelves.
OpenClaw lets you define a model chain for sub-agents in openclaw.json under agents.defaults.subagents.model. Set a primary, set fallbacks. Every sub-agent spawn hits the primary first; if it's unavailable, it falls through the chain automatically. Plug in free models, and your sub-agent costs drop to zero.
Why OpenRouter Free Models
OpenRouter is a unified API layer that routes to dozens of model providers. Any model with the :free suffix in its ID is genuinely always-free. Not a trial, not a limited preview. Rate limits apply, but there's no per-token cost.
As of March 2026, there are 25 models available on the free tier. The quality has gone up significantly as providers like NVIDIA, Google, and Meta have started posting capable models there to drive developer adoption.
How capable? One Reddit thread in r/vibecoding documented a developer running 831 requests, 20.5 million tokens, in 24 hours at $0.00. That's not a toy workload. That's production-adjacent usage running entirely on free tier.
The :free suffix is the key. openrouter/nvidia/nemotron-3-super-120b-a12b:free is a different routing target than the same model without it. OpenRouter's provider routing docs explain the distinction clearly. The :free variant is explicitly capped at free, so there are no surprise charges.
Why Nemotron 3 Super
Not all free models are created equal. We evaluated 25 and landed on NVIDIA's Nemotron 3 Super as the primary for one specific reason: it was built for agentic work.
NVIDIA's announcement describes a model trained via reinforcement learning across 21 agentic environments. That's not a general-purpose LLM that happens to be available via API. It's a model specifically trained on the kinds of tasks sub-agents perform: tool use, multi-step reasoning, instruction following in constrained contexts.
The architecture is worth understanding briefly:
- 120B total parameters, 12B active via Mixture of Experts (MoE). You get the parameter depth of a large model with the inference cost of a smaller one.
- Hybrid Mamba-Transformer: linear attention mechanisms that handle long sequences more efficiently than pure Transformer architectures
- Native 1M token context (262K in the OpenRouter deployment): sub-agents working with long documents don't get truncated
- 5x throughput improvement over its predecessor
- 85.6% on PinchBench: a benchmark specifically designed for agentic task performance
For sub-agent work, this profile is close to ideal. High context, high throughput, purpose-built for tool use and multi-step execution.
The Fallback Chain
Rate limits are real. Any free model will occasionally be unavailable or throttled under load. A single-model setup will fail; a chain handles it gracefully.
We set two fallbacks:
- Gemma 3 27B (
openrouter/google/gemma-3-27b-it:free): a strong general-purpose model that also supports vision (text + image input). Nemotron 3 Super is text-only. If a sub-agent task involves image analysis, Gemma 3 27B is the right model, and having it as the first fallback means vision-capable tasks have a path. - Llama 3.3 70B Instruct (
openrouter/meta-llama/llama-3.3-70b-instruct:free): Meta's instruction-tuned 70B model, well-established and widely used for production workloads.
The config in openclaw.json:
"subagents": {
"model": {
"primary": "openrouter/nvidia/nemotron-3-super-120b-a12b:free",
"fallbacks": [
"openrouter/google/gemma-3-27b-it:free",
"openrouter/meta-llama/llama-3.3-70b-instruct:free"
]
}
}
OpenClaw handles the fallback logic automatically. If the primary hits a rate limit or returns an error, it moves to the next model in the chain without intervention. From the orchestrator's perspective, the sub-agent just returns a result.
The Practical Result
The economics are straightforward. Sub-agents handle the bulk of the operational load in any active agent system: background research, document processing, content drafts, structured lookups. Routing that work to free models means the per-task cost is zero, and the main orchestrator's API spend is reserved for the decisions that actually require it.
Reliability improves too. A three-model chain is more fault-tolerant than any single endpoint. If one provider is having issues, the work continues.
How to Do It Yourself
Step 1: Get an OpenRouter API key
Sign up at openrouter.ai. Free account, no credit card required to use free-tier models.
Step 2: Add it to your environment
export OPENROUTER_API_KEY=sk-or-...
Or add it to your .env file in your OpenClaw workspace.
Step 3: Update openclaw.json
Find or create the agents.defaults block and add the subagents.model config:
{
"agents": {
"defaults": {
"subagents": {
"model": {
"primary": "openrouter/nvidia/nemotron-3-super-120b-a12b:free",
"fallbacks": [
"openrouter/google/gemma-3-27b-it:free",
"openrouter/meta-llama/llama-3.3-70b-instruct:free"
]
}
}
}
}
}
Step 4: Restart OpenClaw and verify
Trigger a sub-agent task and confirm the model routing in your logs. You should see the OpenRouter endpoint being hit for sub-agent spawns.
Ten minutes of config work.
The Bigger Picture
The firms that will run sustainable AI operations over the next few years aren't necessarily the ones with the biggest model budgets. They're the ones that understand where expensive intelligence is actually required and route everything else appropriately.
This is an infrastructure decision, not a product decision. Most teams don't think about it until they're scaling and the invoice arrives. Getting the model routing right early is the kind of unglamorous work that compounds over time.
Get a free copy of The Architect's Advantage shipped to your door.
Free copy. Just cover shipping.
Rob Johnson is the founder of Hawkwork, an AI operations consultancy based in Tacoma, WA.