Why we build on Claude, GPT, and open-source.
We are not model-agnostic. We are model-portfolio. Every agent we ship routes work across at least two of Claude, GPT-4 class, and an open-source model running on our own infrastructure. It's more work than picking a favorite. It's also the only way we've found to build a system that survives a single provider's bad week.
The case for portfolio
Three reasons we don't pick one.
Capability varies by task. Claude is our default for long-context, voice-matched writing, and reasoning over messy inputs. GPT-4 class is our default for classification, tool-calling, and structured output. Open-source models (we run Llama 3.1 70B and Qwen2.5-Coder 32B on dedicated infra) are our default for high- volume simple work where per-token cost dominates. Using one for all three is leaving quality on the table.
Provider risk is real. Outages happen. Rate limits bite. TOS change. A client's production outbound should not go dark because a provider had a bad hour. Route around it.
Cost arbitrage is real. The same task priced from Claude Sonnet, GPT-4o, and our self-hosted Llama can differ by 10x. For volume work, the 10x matters. For quality work, the 10x doesn't — you pick for quality.
FIG. 01 / A ROUTER THAT ASSIGNS MODEL BY TASK SHAPE, NOT BY VENDOR PREFERENCE
How the router works
The router is a fifty-line Python module. It takes a task descriptor — task type, context length, latency requirement, cost ceiling — and returns a model and endpoint. The rules are explicit, documented, and overrideable per-call. No clever ML. No "it learned."
Rules, roughly: voice-matched writing → Claude. Long-context synthesis → Claude. Classification + tool-calling → GPT-4o. Structured data extraction → GPT-4o-mini. High-volume, simple prompt → self-hosted Llama or Qwen. Anything we're unsure about → Claude as default.
Where each model actually shines
Claude. Long context (reads a 50-page Q4 plan and extracts commitments), nuanced tone, analytical writing, refusal handling. It is our default model for any draft that goes out in a founder's voice.
GPT-4 class. Structured-output fidelity, ecosystem tooling, speed on short prompts. When an agent needs to emit valid JSON for the 10,000th time, we reach for OpenAI first.
Open-source (self-hosted). Where the task is simple, the volume is high, and the marginal per-run cost of a frontier API would compound badly. Think: classifying 100k emails as inbound/outbound, scoring sentiment on a week of replies, drafting short variants of an opener where quality variance is acceptable.
Weekly model-spend breakdown
Percentage of spend and tokens routed to each model family last week. Swap with the live dashboard chart.
What this gets us
When Claude had a rough hour on a March morning, our writer agent automatically failed over to GPT-4o with a different prompt template. Quality dropped 5%; clients didn't notice. When we noticed a classification task was burning $800/mo on GPT-4o-mini, we moved it to our self-hosted Llama; the monthly bill dropped to $40 and quality held.
The portfolio isn't perfect. Prompts that work beautifully on Claude sometimes need a rewrite for GPT. Self-hosted inference needs an on-call rotation. We accept those costs because the alternative — "we're an OpenAI shop" or "we're an Anthropic shop" — is a durability risk we've watched other firms eat badly.
When NOT to do this
If you're at day one, ship with one provider. Pick the model that feels best to you, build the agent, measure the outcomes. Don't front-load portfolio complexity for a system that isn't yet working. Portfolio is a month-three concern.
If your volume is small and your team has zero appetite for infra, the open-source leg can wait indefinitely. Run Claude + GPT and call it done.
On "betting on the winner"
We get asked this quarterly. "Aren't you just going to pick one in the end?" The honest answer is no. Even if one provider pulls clearly ahead on every axis, the governance and continuity argument for portfolio doesn't go away. Single-vendor dependency is a posture you earn, not a posture you assume.
We build on Claude, GPT, and open-source because the tasks are different, the costs are different, and the providers are different. Routing across all three isn't strategic hedging. It's just engineering the stack the way it actually wants to be built.
We'll design your model routing too.
Thirty-minute intake. We map your agents to the right models, not the loudest ones.
- Free
- No deck
- Written scope in 48 hrs