
Cost, ops & shipping
A 5-step agent makes 5 model calls. Model choice and tool design are your main cost levers.
Model selection by role:
| Role | Model |
|---|---|
| Tool routing / classification | gpt-4o-mini, claude-haiku-4-5 |
| Final answer synthesis | gpt-4o, claude-sonnet-4-6 |
| Air-gapped / no egress | Local 70B on GPU |
Consider a smaller model for tool selection and a larger one only for the final answer.
Cost guardrails:
- Daily token ceiling: Settings → AI Settings → Quotas
max_tokensper call (start at 1024); cap tool/sub-agent calls per turn- Cache aggressively — identical queries in a session re-use previous results
- Monitor per-tool metrics:
GET /api/chatai/tools/metrics(counts, latency, p95, error rate)
Most agents fall into two shapes — start from the nearest and change the schema:
| Agent | Tools | Model | Shape |
|---|---|---|---|
| Ticket triage | none | small | Single-shot classify → enum + action |
| KB Q&A | search + fetch | mid | RAG → grounded answer with [1] citations |
| Lead qualifier | graph snapshots | mid | RAG → numeric score + reasons |
| Document enricher | none | small | Per-node enrich → summary, tags, sentiment |
- Single-shot extractors — no tools, one model call, one structured output. Small model, big batch (a code index can run one against every node of a type).
- RAG agents — 2–4 focused tools, a larger model, a schema with explicit citation fields.
Shipping checklist:
- Pin a model (
ChatTaskUID) — don't leave it to the caller's default in production. - Smallest tool set that does the job — overlapping tools degrade routing.
-
OutputSchemawhenever a downstream consumes the result. -
CurrentUseron every run and every tool — never the system identity for user-facing work; treat LLM-supplied parameters as untrusted. - Destructive actions: propose → confirm, never auto-execute.
- Export the agent as code and promote dev → staging → production like an endpoint.