Monitoring
A Curiosity Workspace deployment needs visibility on four things:
- Availability — is the workspace reachable, is it serving requests?
- Performance — query latency, ingestion throughput, embedding throughput.
- Background work — ingestion runs, scheduled tasks, index/embedding rebuilds.
- Security signals — auth failures, token usage, admin actions, ACL changes.
This page covers what's built in, how to wire it to your existing observability stack, and which alerts are worth their pager.
Built-in dashboards
The Workspace UI ships a Monitoring view at /admin/monitoring that shows the most critical real-time signals:
- ingestion throughput (nodes and edges per minute);
- indexing and file-processing queue depth;
- search latency percentiles;
- CPU, RAM, disk I/O on the workspace host;
- error rates by category.

Operational metrics API
Two admin-authenticated JSON endpoints expose per-operation metrics for ingestion into external systems:
GET /api/endpoints/metrics— counts, latencies, and error rate per custom endpoint.GET /api/chatai/tools/metrics— counts, latencies, and error rate per AI tool.
Both require a bearer token with admin scope. Sample call:
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://workspace.example.com/api/endpoints/metrics | jq .
You can scrape these into a self-hosted Prometheus instance with a prometheus_http_sd job that maps each entry into a metric. A public Prometheus exposition (/metrics) is on the roadmap but not currently shipped — until then, the JSON endpoints above are the supported integration point.
Log routing
By default the workspace writes structured logs to stdout, which works with every container log collector (CloudWatch, Stackdriver, Azure Monitor, ELK, Loki).
To write logs to a mounted volume — useful on a Windows-installer deployment or when the orchestrator's log collection isn't sufficient — set:
MSK_LOG_PATH=/var/log/curiosity
MSK_LOG_LEVEL=Information # Debug only while diagnosing
Recommended log categories to forward:
- ingestion connector logs;
- endpoint invocation logs;
- admin configuration changes;
- authentication and authorization events.
Health probes
The HTTP endpoint /api/login/check returns:
401once the workspace is accepting traffic;5xxwhile it's still booting or in a degraded state.
For readiness/liveness probes, treat 401 as success — see the examples in Docker and Kubernetes.
High-signal alerts
Start with a small set; expand as you learn the workload.
| Alert | Trigger | Why |
|---|---|---|
| Workspace unavailable | Probe to /api/login/check fails for 2 consecutive minutes |
The single most important signal. |
| Container restart loop | More than 3 restarts in 15 min | Usually OOM, panic on bad input, or storage issue. |
| Search latency P95 regression | P95 > 2× the trailing 7-day baseline for 10 min | Often follows a bad index change or insufficient RAM. |
| Ingestion failure rate | Errors / total > 5% for 30 min on any active task | Catches expired source credentials and schema drift. |
| Index rebuild stuck | Rebuild task running > 6h with no progress event | Likely a slow embedding provider or a saturated disk. |
| Auth failures spike | More than 10× the trailing 1-hour baseline | Brute-force, credential stuffing, or a broken IdP. |
| Disk usage on graph volume | > 80% | Plan a resize before you wedge ingestion. |
What to monitor on the underlying infrastructure
The workspace metrics tell you what the application is doing; the infrastructure metrics tell you whether it has the resources to do it.
- CPU: sustained above 80% across the box is a sign you should size up before adding embeddings.
- Memory: the graph engine maps indexes into memory. Investigate any swap activity.
- Disk IOPS: peak IOPS during ingestion and rebuild — undersized disks throttle the whole system.
- Network: outbound to your LLM provider is a common bottleneck.
Audit trails
Sensitive admin actions are written to the workspace audit log:
- token creation and revocation;
- user/team membership changes;
- schema changes;
- endpoint and tool publication;
- SSO configuration changes.
Forward the audit log to your SIEM. Retention follows your platform's log policy; pick a value that satisfies your compliance window.
Next steps
- Security — the surrounding operational controls.
- Backup and restore — what monitoring should catch before you need backups.
- Troubleshooting — symptom-first responses to monitoring alerts.