Production deployment checklist

A tutorial-shaped take on the Deployment reference page — same content, but ordered as a sequence you can walk for a single workspace going to production. Use it as the "yes / no, did we do this?" page during a launch.

If you're operating multiple workspaces, the canonical reference is Workspace deployment. This tutorial is the abridged developer-facing path through it.

Goals

Goal	Why
Reliability	Predictable uptime, fast recovery.
Security	TLS, scoped tokens, ReBAC, secrets discipline.
Scalability	Handle data growth and query load.
Reproducibility	Dev → staging → prod is mechanical, not heroic.

Phase 1 — environment shape

Mirror the production shape in staging; right-size dev.

Environment	Purpose	Notes
Dev	Engineer-owned, may be on a laptop	Local Docker run, default ports, generated admin password.
Staging	Pre-production validation	Prod-shaped manifest, smaller capacity, isolated secrets.
Production	Real users	Restricted access, change control, no shell access by default.

The promotion path goes: code/config lands in git → tested in dev → deployed to staging → validated → promoted to prod. No exceptions.

Phase 2 — what's in git

Everything below should be a tracked, versioned artifact:

Connector code
Custom endpoint and AI tool code (exported from the workspace UI)
Custom interface bundles (Tesserae / H5)
Schema migrations and ingestion pipeline definitions
Search index configuration (indexed fields, boosts, facets)
NLP pipeline configuration (entity capture, embeddings field selection)
The deployment manifest (Docker Compose / Kubernetes / Helm / Terraform)

The workspace stores UI-managed configuration inside the graph, so a configuration export + import lets you snapshot and promote workspaces. See Backup and restore.

Phase 3 — the checklist

Image and runtime

Versioned image tag (curiosityai/curiosity:vX.Y.Z), not :latest.
Container memory and CPU sized for embeddings (start at 16 GB / 8 vCPU; bigger for large corpora).
Healthcheck on /api/login/check.
terminationGracePeriodSeconds ≥ 60 so the workspace can flush before being killed.

Storage

Persistent volume on SSD-backed block storage attached to MSK_GRAPH_STORAGE.
Separate volume (or directory) for MSK_GRAPH_BACKUP_FOLDER.
Backups scheduled, off-host, and tested by restoring to a sandbox.
Volume expansion enabled so you can grow without downtime.

Networking and TLS

TLS terminated at the proxy or inside the container; HSTS enabled.
MSK_PUBLIC_ADDRESS set to the user-facing URL.
No 0.0.0.0 exposure without an authenticating front-end.
Egress allowlist documented (Docker registry, NuGet, your LLM provider).

Identity and secrets

MSK_ADMIN_PASSWORD set explicitly (default admin/admin never used).
MSK_JWT_KEY set explicitly so tokens survive restarts.
MSK_GRAPH_MASTER_KEY set explicitly and backed up — losing it means losing encrypted content.
All secrets injected from a secret manager (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, Vault).
At least one SSO provider configured.
Admin sign-in via SSO only; the local admin account disabled after onboarding.

Permissions and tokens

Connectors run on dedicated tokens with ingestion scope only.
External integrations use endpoint tokens scoped to specific endpoints.
Token rotation documented and scheduled.
Every user-facing endpoint uses CreateSearchAsUserAsync (not CreateSearchAsync).

Observability

Stdout logs routed to your aggregator; audit log forwarded to your SIEM.
Alerts on liveness, latency regressions, ingestion failures, container restart rate.
Per-endpoint and per-tool metrics scraped into your monitoring system.

Disaster recovery

Documented RPO and RTO targets.
Restore drill completed within the past quarter.
Secrets manager backups verified.

Phase 4 — reverse proxy

A minimal NGINX block, ready to drop in:

server {
    listen 443 ssl http2;
    server_name workspace.example.com;

    ssl_certificate     /etc/letsencrypt/live/workspace.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/workspace.example.com/privkey.pem;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    client_max_body_size 100m;
    proxy_read_timeout    300s;

    location / {
        proxy_pass         http://127.0.0.1:8080;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }
}

Set MSK_PUBLIC_ADDRESS=https://workspace.example.com so generated links use the proxy's hostname.

Phase 5 — rolling out a change

flowchart LR A[Git: code/config] --> B[Deploy to staging] B --> C{Smoke + restore drill OK?} C -->|no| Rollback1[Hold; debug; reapply] C -->|yes| D[Take prod backup] D --> E[Promote to prod] E --> F[Watch monitoring 30 min] F --> G{Healthy?} G -->|no| Rollback2[Revert image tag + config<br/>restore backup if schema changed] G -->|yes| Done[Done]

Recommended sequence:

Take a backup of the graph volume.
Apply the change in staging; walk the post-restore validation in Backup and restore.
Promote to production during a low-traffic window.
Watch Monitoring for 30 minutes.
Be prepared to roll back: revert image tag + config, restart, restore if data shape changed.

Phase 6 — application-level checklist

The platform checklist gets the workspace healthy; the app-level checklist makes sure the app is right.

Connectors have idempotency tests (re-running is a no-op for unchanged records).
Schema migrations are committed to git and replayable.
Every endpoint that runs on behalf of a user calls CreateSearchAsUserAsync.
Eval suite green for retrieval and AI tools (see Evaluation framework).
Audit nodes (for chat, AI tools, sensitive endpoints) are write-only for non-admins.
Token budgets and rate limits set on the LLM provider account.
On-call runbook covers: workspace down, ingestion stuck, LLM provider outage, sudden ACL leak.

Cross-links

Deployment reference — same checklist, organized as a reference
Backup and restore
Monitoring
Permission-aware search
Full-stack RAG app — wires the above into a working app

Referenced by

Full-stack RAG app