Curiosity

Monitoring

A Curiosity Workspace deployment needs visibility on four things:

  • Availability — is the workspace reachable, is it serving requests?
  • Performance — query latency, ingestion throughput, embedding throughput.
  • Background work — ingestion runs, scheduled tasks, index/embedding rebuilds.
  • Security signals — auth failures, token usage, admin actions, ACL changes.

This page covers what's built in, how to wire it to your existing observability stack, and which alerts are worth their pager.

Built-in dashboards

The Workspace UI ships a Monitoring view at /admin/monitoring that shows the most critical real-time signals:

  • ingestion throughput (nodes and edges per minute);
  • indexing and file-processing queue depth;
  • search latency percentiles;
  • CPU, RAM, disk I/O on the workspace host;
  • error rates by category.

Curiosity Workspace Monitoring Dashboard

Operational metrics API

Two admin-authenticated JSON endpoints expose per-operation metrics for ingestion into external systems:

  • GET /api/endpoints/metrics — counts, latencies, and error rate per custom endpoint.
  • GET /api/chatai/tools/metrics — counts, latencies, and error rate per AI tool.

Both require a bearer token with admin scope. Sample call:

curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  https://workspace.example.com/api/endpoints/metrics | jq .

You can scrape these into a self-hosted Prometheus instance with a prometheus_http_sd job that maps each entry into a metric. A public Prometheus exposition (/metrics) is on the roadmap but not currently shipped — until then, the JSON endpoints above are the supported integration point.

Log routing

By default the workspace writes structured logs to stdout, which works with every container log collector (CloudWatch, Stackdriver, Azure Monitor, ELK, Loki).

To write logs to a mounted volume — useful on a Windows-installer deployment or when the orchestrator's log collection isn't sufficient — set:

MSK_LOG_PATH=/var/log/curiosity
MSK_LOG_LEVEL=Information     # Debug only while diagnosing

Recommended log categories to forward:

  • ingestion connector logs;
  • endpoint invocation logs;
  • admin configuration changes;
  • authentication and authorization events.

Health probes

The HTTP endpoint /api/login/check returns:

  • 401 once the workspace is accepting traffic;
  • 5xx while it's still booting or in a degraded state.

For readiness/liveness probes, treat 401 as success — see the examples in Docker and Kubernetes.

High-signal alerts

Start with a small set; expand as you learn the workload.

Alert Trigger Why
Workspace unavailable Probe to /api/login/check fails for 2 consecutive minutes The single most important signal.
Container restart loop More than 3 restarts in 15 min Usually OOM, panic on bad input, or storage issue.
Search latency P95 regression P95 > 2× the trailing 7-day baseline for 10 min Often follows a bad index change or insufficient RAM.
Ingestion failure rate Errors / total > 5% for 30 min on any active task Catches expired source credentials and schema drift.
Index rebuild stuck Rebuild task running > 6h with no progress event Likely a slow embedding provider or a saturated disk.
Auth failures spike More than 10× the trailing 1-hour baseline Brute-force, credential stuffing, or a broken IdP.
Disk usage on graph volume > 80% Plan a resize before you wedge ingestion.

What to monitor on the underlying infrastructure

The workspace metrics tell you what the application is doing; the infrastructure metrics tell you whether it has the resources to do it.

  • CPU: sustained above 80% across the box is a sign you should size up before adding embeddings.
  • Memory: the graph engine maps indexes into memory. Investigate any swap activity.
  • Disk IOPS: peak IOPS during ingestion and rebuild — undersized disks throttle the whole system.
  • Network: outbound to your LLM provider is a common bottleneck.

Audit trails

Sensitive admin actions are written to the workspace audit log:

  • token creation and revocation;
  • user/team membership changes;
  • schema changes;
  • endpoint and tool publication;
  • SSO configuration changes.

Forward the audit log to your SIEM. Retention follows your platform's log policy; pick a value that satisfies your compliance window.

Next steps

© 2026 Curiosity. All rights reserved.
Powered by Neko