Curiosity

Production deployment checklist

A tutorial-shaped take on the Deployment reference page — same content, but ordered as a sequence you can walk for a single workspace going to production. Use it as the "yes / no, did we do this?" page during a launch.

If you're operating multiple workspaces, the canonical reference is Workspace deployment. This tutorial is the abridged developer-facing path through it.

Goals

Goal Why
Reliability Predictable uptime, fast recovery.
Security TLS, scoped tokens, ReBAC, secrets discipline.
Scalability Handle data growth and query load.
Reproducibility Dev → staging → prod is mechanical, not heroic.

Phase 1 — environment shape

Mirror the production shape in staging; right-size dev.

Environment Purpose Notes
Dev Engineer-owned, may be on a laptop Local Docker run, default ports, generated admin password.
Staging Pre-production validation Prod-shaped manifest, smaller capacity, isolated secrets.
Production Real users Restricted access, change control, no shell access by default.

The promotion path goes: code/config lands in git → tested in dev → deployed to staging → validated → promoted to prod. No exceptions.

Phase 2 — what's in git

Everything below should be a tracked, versioned artifact:

  • Connector code
  • Custom endpoint and AI tool code (exported from the workspace UI)
  • Custom interface bundles (Tesserae / H5)
  • Schema migrations and ingestion pipeline definitions
  • Search index configuration (indexed fields, boosts, facets)
  • NLP pipeline configuration (entity capture, embeddings field selection)
  • The deployment manifest (Docker Compose / Kubernetes / Helm / Terraform)

The workspace stores UI-managed configuration inside the graph, so a configuration export + import lets you snapshot and promote workspaces. See Backup and restore.

Phase 3 — the checklist

Image and runtime

  • Versioned image tag (curiosityai/curiosity:vX.Y.Z), not :latest.
  • Container memory and CPU sized for embeddings (start at 16 GB / 8 vCPU; bigger for large corpora).
  • Healthcheck on /api/login/check.
  • terminationGracePeriodSeconds ≥ 60 so the workspace can flush before being killed.

Storage

  • Persistent volume on SSD-backed block storage attached to MSK_GRAPH_STORAGE.
  • Separate volume (or directory) for MSK_GRAPH_BACKUP_FOLDER.
  • Backups scheduled, off-host, and tested by restoring to a sandbox.
  • Volume expansion enabled so you can grow without downtime.

Networking and TLS

  • TLS terminated at the proxy or inside the container; HSTS enabled.
  • MSK_PUBLIC_ADDRESS set to the user-facing URL.
  • No 0.0.0.0 exposure without an authenticating front-end.
  • Egress allowlist documented (Docker registry, NuGet, your LLM provider).

Identity and secrets

  • MSK_ADMIN_PASSWORD set explicitly (default admin/admin never used).
  • MSK_JWT_KEY set explicitly so tokens survive restarts.
  • MSK_GRAPH_MASTER_KEY set explicitly and backed up — losing it means losing encrypted content.
  • All secrets injected from a secret manager (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, Vault).
  • At least one SSO provider configured.
  • Admin sign-in via SSO only; the local admin account disabled after onboarding.

Permissions and tokens

  • Connectors run on dedicated tokens with ingestion scope only.
  • External integrations use endpoint tokens scoped to specific endpoints.
  • Token rotation documented and scheduled.
  • Every user-facing endpoint uses CreateSearchAsUserAsync (not CreateSearchAsync).

Observability

  • Stdout logs routed to your aggregator; audit log forwarded to your SIEM.
  • Alerts on liveness, latency regressions, ingestion failures, container restart rate.
  • Per-endpoint and per-tool metrics scraped into your monitoring system.

Disaster recovery

  • Documented RPO and RTO targets.
  • Restore drill completed within the past quarter.
  • Secrets manager backups verified.

Phase 4 — reverse proxy

A minimal NGINX block, ready to drop in:

server {
    listen 443 ssl http2;
    server_name workspace.example.com;

    ssl_certificate     /etc/letsencrypt/live/workspace.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/workspace.example.com/privkey.pem;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    client_max_body_size 100m;
    proxy_read_timeout    300s;

    location / {
        proxy_pass         http://127.0.0.1:8080;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }
}

Set MSK_PUBLIC_ADDRESS=https://workspace.example.com so generated links use the proxy's hostname.

Phase 5 — rolling out a change

flowchart LR A[Git: code/config] --> B[Deploy to staging] B --> C{Smoke + restore drill OK?} C -->|no| Rollback1[Hold; debug; reapply] C -->|yes| D[Take prod backup] D --> E[Promote to prod] E --> F[Watch monitoring 30 min] F --> G{Healthy?} G -->|no| Rollback2[Revert image tag + config<br/>restore backup if schema changed] G -->|yes| Done[Done]

Recommended sequence:

  1. Take a backup of the graph volume.
  2. Apply the change in staging; walk the post-restore validation in Backup and restore.
  3. Promote to production during a low-traffic window.
  4. Watch Monitoring for 30 minutes.
  5. Be prepared to roll back: revert image tag + config, restart, restore if data shape changed.

Phase 6 — application-level checklist

The platform checklist gets the workspace healthy; the app-level checklist makes sure the app is right.

  • Connectors have idempotency tests (re-running is a no-op for unchanged records).
  • Schema migrations are committed to git and replayable.
  • Every endpoint that runs on behalf of a user calls CreateSearchAsUserAsync.
  • Eval suite green for retrieval and AI tools (see Evaluation framework).
  • Audit nodes (for chat, AI tools, sensitive endpoints) are write-only for non-admins.
  • Token budgets and rate limits set on the LLM provider account.
  • On-call runbook covers: workspace down, ingestion stuck, LLM provider outage, sudden ACL leak.

Referenced by

© 2026 Curiosity. All rights reserved.
Powered by Neko