Production deployment checklist
A tutorial-shaped take on the Deployment reference page — same content, but ordered as a sequence you can walk for a single workspace going to production. Use it as the "yes / no, did we do this?" page during a launch.
If you're operating multiple workspaces, the canonical reference is Workspace deployment. This tutorial is the abridged developer-facing path through it.
Goals
| Goal | Why |
|---|---|
| Reliability | Predictable uptime, fast recovery. |
| Security | TLS, scoped tokens, ReBAC, secrets discipline. |
| Scalability | Handle data growth and query load. |
| Reproducibility | Dev → staging → prod is mechanical, not heroic. |
Phase 1 — environment shape
Mirror the production shape in staging; right-size dev.
| Environment | Purpose | Notes |
|---|---|---|
| Dev | Engineer-owned, may be on a laptop | Local Docker run, default ports, generated admin password. |
| Staging | Pre-production validation | Prod-shaped manifest, smaller capacity, isolated secrets. |
| Production | Real users | Restricted access, change control, no shell access by default. |
The promotion path goes: code/config lands in git → tested in dev → deployed to staging → validated → promoted to prod. No exceptions.
Phase 2 — what's in git
Everything below should be a tracked, versioned artifact:
- Connector code
- Custom endpoint and AI tool code (exported from the workspace UI)
- Custom interface bundles (Tesserae / H5)
- Schema migrations and ingestion pipeline definitions
- Search index configuration (indexed fields, boosts, facets)
- NLP pipeline configuration (entity capture, embeddings field selection)
- The deployment manifest (Docker Compose / Kubernetes / Helm / Terraform)
The workspace stores UI-managed configuration inside the graph, so a configuration export + import lets you snapshot and promote workspaces. See Backup and restore.
Phase 3 — the checklist
Image and runtime
- Versioned image tag (
curiosityai/curiosity:vX.Y.Z), not:latest. - Container memory and CPU sized for embeddings (start at 16 GB / 8 vCPU; bigger for large corpora).
- Healthcheck on
/api/login/check. -
terminationGracePeriodSeconds≥ 60 so the workspace can flush before being killed.
Storage
- Persistent volume on SSD-backed block storage attached to
MSK_GRAPH_STORAGE. - Separate volume (or directory) for
MSK_GRAPH_BACKUP_FOLDER. - Backups scheduled, off-host, and tested by restoring to a sandbox.
- Volume expansion enabled so you can grow without downtime.
Networking and TLS
- TLS terminated at the proxy or inside the container; HSTS enabled.
-
MSK_PUBLIC_ADDRESSset to the user-facing URL. - No
0.0.0.0exposure without an authenticating front-end. - Egress allowlist documented (Docker registry, NuGet, your LLM provider).
Identity and secrets
-
MSK_ADMIN_PASSWORDset explicitly (defaultadmin/adminnever used). -
MSK_JWT_KEYset explicitly so tokens survive restarts. -
MSK_GRAPH_MASTER_KEYset explicitly and backed up — losing it means losing encrypted content. - All secrets injected from a secret manager (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, Vault).
- At least one SSO provider configured.
- Admin sign-in via SSO only; the local
adminaccount disabled after onboarding.
Permissions and tokens
- Connectors run on dedicated tokens with
ingestionscope only. - External integrations use endpoint tokens scoped to specific endpoints.
- Token rotation documented and scheduled.
- Every user-facing endpoint uses
CreateSearchAsUserAsync(notCreateSearchAsync).
Observability
- Stdout logs routed to your aggregator; audit log forwarded to your SIEM.
- Alerts on liveness, latency regressions, ingestion failures, container restart rate.
- Per-endpoint and per-tool metrics scraped into your monitoring system.
Disaster recovery
- Documented RPO and RTO targets.
- Restore drill completed within the past quarter.
- Secrets manager backups verified.
Phase 4 — reverse proxy
A minimal NGINX block, ready to drop in:
server {
listen 443 ssl http2;
server_name workspace.example.com;
ssl_certificate /etc/letsencrypt/live/workspace.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/workspace.example.com/privkey.pem;
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
client_max_body_size 100m;
proxy_read_timeout 300s;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Set MSK_PUBLIC_ADDRESS=https://workspace.example.com so generated links use the proxy's hostname.
Phase 5 — rolling out a change
Recommended sequence:
- Take a backup of the graph volume.
- Apply the change in staging; walk the post-restore validation in Backup and restore.
- Promote to production during a low-traffic window.
- Watch Monitoring for 30 minutes.
- Be prepared to roll back: revert image tag + config, restart, restore if data shape changed.
Phase 6 — application-level checklist
The platform checklist gets the workspace healthy; the app-level checklist makes sure the app is right.
- Connectors have idempotency tests (re-running is a no-op for unchanged records).
- Schema migrations are committed to git and replayable.
- Every endpoint that runs on behalf of a user calls
CreateSearchAsUserAsync. - Eval suite green for retrieval and AI tools (see Evaluation framework).
- Audit nodes (for chat, AI tools, sensitive endpoints) are write-only for non-admins.
- Token budgets and rate limits set on the LLM provider account.
- On-call runbook covers: workspace down, ingestion stuck, LLM provider outage, sudden ACL leak.
Cross-links
- Deployment reference — same checklist, organized as a reference
- Backup and restore
- Monitoring
- Permission-aware search
- Full-stack RAG app — wires the above into a working app