ADR-005: LiteLLM Gateway for Model Abstraction

Status: Accepted Date: 2025-10-25 Author: Spencer Fuller

Context

The OpenClaw platform runs multiple AI agents that make LLM API calls — chat completions, embeddings, function calling — across different providers (Anthropic Claude, OpenAI GPT, Google Gemini). Each agent has different model requirements: the main agent needs high-capability reasoning (Claude Opus), subagents can use cheaper models (Claude Sonnet, GPT-4o-mini), and lightweight tasks (summarization, classification) could run on local models via Ollama on the ARM64 nodes.

Without abstraction, each agent embeds provider-specific SDK calls, API keys, and retry logic. Switching from Claude to GPT for a specific agent means code changes. Cost tracking requires instrumenting every agent individually. Fallback behavior (if Anthropic is down, try OpenAI) must be implemented per-agent.

Key requirements:

Provider portability — swap models without changing agent code
Cost tracking — per-agent, per-session cost visibility across all providers
Fallback chains — automatic failover when a provider is unavailable
Local inference — route lightweight tasks to Ollama running on-cluster for cost savings and latency reduction
Kubernetes-native — deploy as a service with autoscaling based on request load

Decision

Deploy LiteLLM as a centralized model gateway in the llm-gateway namespace, sitting between all agents and LLM providers. Agents make OpenAI-compatible API calls to LiteLLM’s endpoint, which routes requests to the configured provider based on model name, applies fallback chains, and logs cost/usage metrics. Autoscaling is handled by KEDA (Kubernetes Event-Driven Autoscaling) based on request queue depth.

Rationale

OpenAI-compatible API as the universal interface. LiteLLM exposes a single OpenAI-compatible endpoint (/chat/completions, /embeddings). Agents code against one API regardless of the backing provider. Switching an agent from Claude Opus to GPT-4o means changing a model string in configuration, not rewriting API calls. This is provider portability without abstraction layers in application code.
Built-in cost tracking per API key and model. LiteLLM logs token usage and estimated cost for every request, tagged by API key (mapped to agent identity) and model. This gives per-agent cost breakdowns without instrumenting each agent — the gateway observes all traffic. For a homelab running multiple agents, knowing that “the main agent spent $12 on Claude Opus this week while subagents spent $3 on Sonnet” informs model selection decisions.
Fallback chains for provider resilience. LiteLLM supports model fallback configuration: if Claude is rate-limited or down, automatically try GPT-4o, then Gemini Pro. This is configured at the gateway level — agents don’t know or care about failover logic. Provider outages (which happen more often than you’d expect) stop being agent-visible incidents.
Local inference routing via Ollama. The cluster’s Orange Pi 5 nodes (RK3588 with 6 TOPS NPU) run Ollama for lightweight models. LiteLLM routes requests for local models (e.g., ollama/llama3.2) to the on-cluster Ollama instance. Tasks like text classification, simple summarization, or embedding generation run locally — zero API cost, sub-100ms latency, and no external network dependency.
KEDA autoscaling based on demand. KEDA scales the LiteLLM deployment based on request metrics. During quiet periods (overnight), the gateway scales down to minimal replicas. When multiple agents are active (daytime, batch processing), it scales up. This keeps resource usage proportional to actual demand on resource-constrained SBC nodes.

Alternatives Considered

Alternative	Why Not
Direct provider SDKs	Each agent imports Anthropic’s SDK, OpenAI’s SDK, or Google’s SDK directly. This works but creates vendor lock-in at the code level — switching providers means changing import statements, request formats, and error handling in every agent. No centralized cost tracking. No fallback chains without per-agent implementation. Model selection is hardcoded rather than configurable.
Custom proxy	Build a bespoke model routing proxy with exactly the features needed. But LiteLLM already exists, is actively maintained, and supports 100+ providers. Building a custom proxy means maintaining provider-specific API adapters, token counting logic, cost calculation tables, and fallback orchestration — all of which LiteLLM handles. The maintenance burden isn’t justified when an open-source solution fits.
vLLM	vLLM is a high-performance inference engine for self-hosting large models. It’s designed for GPU clusters running 70B+ parameter models with continuous batching and PagedAttention. On a 4-node ARM64 SBC cluster with no discrete GPUs, vLLM is both overkill and impractical. The cluster’s local inference needs are served by Ollama running small models; heavy inference is delegated to cloud providers via LiteLLM.

Consequences

Positive

Agents are provider-agnostic — model selection is a configuration change, not a code change
Centralized cost dashboard shows per-agent, per-model spending without per-agent instrumentation
Provider outages are handled transparently via fallback chains; agents see retries, not failures
Lightweight tasks run on-cluster via Ollama, saving API costs and reducing external dependencies

Negative

Additional network hop. Every LLM request traverses agent → LiteLLM → provider instead of agent → provider directly. For streaming chat completions, the gateway adds minimal latency (first-token-time increases by a few milliseconds). For high-throughput batch operations, the gateway becomes a bottleneck if not scaled appropriately
Gateway complexity. LiteLLM is a substantial Python application with its own configuration, database, and API. It adds operational surface area — another service to monitor, upgrade, and debug when things go wrong
Token counting accuracy. LiteLLM estimates token counts and costs based on model pricing tables that may lag behind provider changes. Cost tracking is directionally accurate but not authoritative — it’s a planning tool, not an accounting system
KEDA scaling latency. Cold-start scaling (0 → 1 replicas) takes 10-30 seconds as the pod starts and loads configuration. During this window, agent requests queue or fail. Mitigated by maintaining a minimum of 1 replica, but this trades resource efficiency for availability