Architecture · Technical Reference

NBA model routing: tier-aware inference architecture.

Under the Next Best Action product surface is a tier-aware model routing layer that decides which Claude model handles each inference call. This is the technical reference, with the decision logic, latency budgets, cost numbers, and failure modes we have learned to expect.

9 minute read · For engineering, architecture, and platform leaders

A real-time personalization product cannot afford to route every inference call to the most capable model. The cost compounds too fast, and most real-time decisions do not need the capability. NBA solves this with a model router that makes a routing decision per call based on the workload class, the stakes, and the latency budget. Here is the reference.

The three model tiers in the NBA stack

Tier 1: Claude Haiku 4.5 — the workhorse

Haiku 4.5 handles real-time scoring, classification, extraction, and simple transformations. The model is fast enough to fit inside a 200ms latency budget on most calls and cheap enough to run thousands of times per day per client. In a typical NBA deployment, Haiku 4.5 handles roughly 95% of all inference calls.

What runs on Haiku 4.5: event scoring, sentiment classification, entity extraction, campaign tag assignment, lifecycle stage detection, support ticket triage, draft summary generation, language detection.

Tier 2: Claude Fable 5 — the reasoning model

Fable 5 handles recommendation generation, multi-source synthesis, agentic workflows, and strategic reasoning. It is the model that produces the next-best-action recommendations the marketing leader sees in the NBA Approve interface. In a typical deployment, Fable 5 handles roughly 4% of inference calls and a much higher share of total token spend.

What runs on Fable 5: next-best-action recommendation generation, customer intelligence synthesis, campaign performance reasoning, multi-segment reallocation analysis, agentic loop orchestration, customer journey design.

Tier 3: Claude Opus 4.8 — sensitive-domain fallback

Opus 4.8 is the explicit choice for regulated content and the automatic fallback for the ~5% of Fable 5 calls that Anthropic’s safety classifiers route away from. In a typical NBA deployment, Opus 4.8 handles less than 1% of inference calls.

What runs on Opus 4.8: regulated industry messaging (healthcare, financial services), compliance-sensitive recommendation generation, audit-trail reasoning where predictable safety behavior is required.

The routing decision logic

The router evaluates each inference call against four criteria, in order.

Decision 1: Latency budget

If the call needs to return in under 250ms, it routes to Haiku 4.5 by default. Real-time scoring, in-page personalization, and event-based triggers all sit here. Fable 5 cannot reliably hit this latency budget for non-trivial reasoning.

Decision 2: Workload class

Calls are classified by what they produce. Single-output classifications and extractions go to Haiku 4.5. Multi-step reasoning and recommendation generation go to Fable 5. Tool-using agentic loops go to Fable 5.

Decision 3: Sensitive-domain detection

An upstream classifier (running on Haiku 4.5) tags calls by domain. Sensitive domains (healthcare, finance, regulated industries) route directly to Opus 4.8 rather than relying on Fable 5’s automatic fallback. The reasoning: predictable model behavior is worth the slight capability tradeoff in regulated workflows.

Decision 4: Cost ceiling check

Per-client cost ceilings are enforced at the router layer. If a client’s monthly Fable 5 spend is approaching the budget, the router downgrades borderline calls to Haiku 4.5 with a flag for review. The marketing operator running NBA receives an alert. No surprise bills.

State management inside the router

The router maintains a per-customer state object that travels with each inference call. The state holds the customer’s relevant history, the current campaign context, recent interactions, and the active recommendation thread. State size is actively managed:

Rolling summarization. Older interactions get summarized every 30 days to keep state under a token budget.
Relevance pruning. State elements that have not been touched by a recommendation in 90 days drop out unless flagged as durable.
Just-in-time loading. Some state (full transcript history, detailed campaign performance) loads only when a Fable 5 reasoning call needs it.

The state object structure is documented and stable. Different model tiers see different views of state. Haiku 4.5 sees a compressed view sized for fast scoring. Fable 5 sees the full reasoning view. This is the optimization that lets the same architecture run cheap on routine work and rich on the work that matters.

Failure modes and how we handle them

Three failure modes show up often enough to design around.

Fable 5 unexpected fallback to Opus 4.8

Anthropic’s safety classifiers route some queries to Opus 4.8 automatically. The router detects this via the response model metadata and logs it. Most workflows are unaffected. If the workflow requires Fable 5’s capability specifically, the router can retry with prompt rephrasing or escalate to a human operator.

Latency outliers on Haiku 4.5

Haiku 4.5 occasionally exceeds the latency budget on complex calls. The router uses circuit-breaker logic: if Haiku 4.5 misses three consecutive latency budgets on a given workload class, the router temporarily reroutes that workload to a cached response or escalates to Fable 5 for the next minute. The pattern resets automatically once latency normalizes.

Cost ceiling breach

If Fable 5 spend approaches the per-client monthly ceiling, the router progressively downgrades borderline calls and emits an alert. The operator and the client are notified. Hard ceilings prevent surprise overages.

Observability requirements

Every inference call emits a structured log entry with: workload class, routing decision, model used, latency, token counts, cost, and response quality signal (when available). These flow into a per-client dashboard that surfaces:

Cost trajectory against monthly ceiling
Latency distribution per workload class
Routing decision distribution (% Haiku vs Fable vs Opus)
Fallback frequency and triggers
Quality signals (approval rate on Fable 5 recommendations)

The observability layer is non-negotiable. Production NBA deployments do not run without it.

What this costs at typical scale

For a mid-market client running NBA across ~50,000 contacts with 5 active campaigns:

Haiku 4.5 monthly volume

~150K calls

Haiku 4.5 monthly cost

~$30-80

Fable 5 monthly volume

~2.5K calls

Fable 5 monthly cost

~$300-500

Opus 4.8 monthly volume

~50 calls

Opus 4.8 monthly cost

~$10-30

Total monthly model spend in the $350-$600 range for the typical mid-market deployment. The full NBA implementation cost includes orchestration, integrations, and operator overlay above this number.

Frequently asked questions

Why not just route everything to the most capable model?

Cost. At Fable 5 pricing, running 150,000 monthly calls through Fable 5 instead of Haiku 4.5 would multiply the model cost line by roughly 10x with no proportional quality improvement on those workloads. The router exists because the cost difference at scale is meaningful.

Does the router add latency?

The routing decision itself takes under 5ms. The total latency impact is negligible compared to the inference call. The bigger latency consideration is model choice: routing to Haiku 4.5 instead of Fable 5 saves 800-2000ms on most calls.

Can clients see the routing decisions?

Yes. The NBA admin surface includes a routing audit log that shows which model handled each call, why it was routed there, and what the per-call cost was. This is part of the operational transparency we ship by default.

How does this handle model deprecation?

The router maps workload classes to model IDs through a configuration layer. When Anthropic ships Fable 6 or deprecates Opus 4.8, we update the mapping. No code changes in the orchestration or product layers. We ran this same pattern through the Sonnet → Opus 4.0 → Opus 4.8 → Fable 5 transitions.

Can we use a different model provider with this architecture?

Yes. The router abstraction is provider-agnostic. We have run identical architectures with OpenAI models and Google models. The decision logic in this article uses Anthropic because the cost-to-capability ratio is currently strongest there. We re-evaluate quarterly.