Curiosity - Cost, ops & shipping

Cost, ops & shipping

A 5-step agent makes 5 model calls. Model choice and tool design are your main cost levers.

Model selection by role:

Role	Model
Tool routing / classification	`gpt-4o-mini`, `claude-haiku-4-5`
Final answer synthesis	`gpt-4o`, `claude-sonnet-4-6`
Air-gapped / no egress	Local 70B on GPU

Consider a smaller model for tool selection and a larger one only for the final answer.

Cost guardrails:

Daily token ceiling: Settings → AI Settings → Quotas
max_tokens per call (start at 1024); cap tool/sub-agent calls per turn
Cache aggressively — identical queries in a session re-use previous results
Monitor per-tool metrics: GET /api/chatai/tools/metrics (counts, latency, p95, error rate)

Most agents fall into two shapes — start from the nearest and change the schema:

Agent	Tools	Model	Shape
Ticket triage	none	small	Single-shot classify → enum + action
KB Q&A	search + fetch	mid	RAG → grounded answer with `[1]` citations
Lead qualifier	graph snapshots	mid	RAG → numeric score + reasons
Document enricher	none	small	Per-node enrich → summary, tags, sentiment

Single-shot extractors — no tools, one model call, one structured output. Small model, big batch (a code index can run one against every node of a type).
RAG agents — 2–4 focused tools, a larger model, a schema with explicit citation fields.

Shipping checklist:

Pin a model (ChatTaskUID) — don't leave it to the caller's default in production.
Smallest tool set that does the job — overlapping tools degrade routing.
OutputSchema whenever a downstream consumes the result.
CurrentUser on every run and every tool — never the system identity for user-facing work; treat LLM-supplied parameters as untrusted.
Destructive actions: propose → confirm, never auto-execute.
Export the agent as code and promote dev → staging → production like an endpoint.