Monitoring
A Curiosity Workspace deployment needs visibility on four things:
- Availability — is the workspace reachable, is it serving requests?
- Performance — query latency, ingestion throughput, embedding throughput.
- Background work — ingestion runs, scheduled tasks, index/embedding rebuilds.
- Security signals — auth failures, token usage, admin actions, ACL changes.
This page covers what's built in, how to wire it to your existing observability stack, and which alerts are worth their pager.
Built-in dashboards
The Workspace UI ships a Monitoring view at /admin/monitoring that shows the most critical real-time signals:
- ingestion throughput (nodes and edges per minute);
- indexing and file-processing queue depth;
- search latency percentiles;
- CPU, RAM, disk I/O on the workspace host;
- error rates by category.

Operational metrics API
Two admin-authenticated JSON endpoints expose per-operation metrics for ingestion into external systems:
GET /api/endpoints/metrics— counts, latencies, and error rate per custom endpoint.GET /api/chatai/tools/metrics— counts, latencies, and error rate per AI tool.
Both require a bearer token with admin scope. Sample call:
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
https://workspace.example.com/api/endpoints/metrics | jq .
You can scrape these into a self-hosted Prometheus instance with a prometheus_http_sd job that maps each entry into a metric. A public Prometheus exposition (/metrics) is on the roadmap but not currently shipped — until then, the JSON endpoints above are the supported integration point.
Log routing
By default the workspace writes structured logs to stdout, which works with every container log collector (CloudWatch, Stackdriver, Azure Monitor, ELK, Loki).
To write logs to a mounted volume — useful on a Windows-installer deployment or when the orchestrator's log collection isn't sufficient — set:
MSK_LOG_PATH=/var/log/curiosity
MSK_LOG_LEVEL=Information # Debug only while diagnosing
Recommended log categories to forward:
- ingestion connector logs;
- endpoint invocation logs;
- admin configuration changes;
- authentication and authorization events.
Health probes
The workspace exposes an unauthenticated ASP.NET Core health-check endpoint at /health, registered by MosaikStartup against both the root pipeline and any configured ServerBasePath. Concretely, that means a default deployment serves the endpoint at:
https://<workspace>/health— and athttps://<workspace>/{ServerBasePath}/healthwhen a base path is configured.
The endpoint runs three checks on every request:
| Check | What it asserts |
|---|---|
memory |
Process memory is below the threshold set by MSK_UnhealthyMemoryThreshold (bytes). Unset = check is informational only. |
disk |
Free disk space on the graph volume is above MSK_UnhealthyDiskThreshold (bytes). Unset = check is informational only. |
ready |
The graph store has finished loading and the process is not in the middle of a shutdown. |
HTTP status codes
This is the standard ASP.NET Core UseHealthChecks contract:
| Aggregate status | HTTP status | Use as |
|---|---|---|
Healthy |
200 OK |
Liveness and readiness |
Degraded |
200 OK |
Liveness — workspace is serving traffic with reduced capacity (e.g. memory pressure). |
Unhealthy |
503 Service Unavailable |
Fail liveness/readiness. Returned while the graph is still loading, during shutdown, or when a threshold-based check trips. |
For Kubernetes-style probes, treat 200 as success and any other code as failure — there is no longer a need for the old "401 means up" workaround. The endpoint is reachable before authentication is wired up, so it is safe for orchestrator probes.
Response body
The endpoint always responds with Content-Type: application/json. The body is produced by WriteHealthReportResponseAsJson and has the following shape:
{
"status": "Healthy",
"results": {
"memory": {
"status": "Healthy",
"description": null,
"data": {
"AllocatedBytes": 1234567890,
"Threshold": 8589934592
}
},
"disk": {
"status": "Healthy",
"description": null,
"data": {
"FreeBytes": 53687091200,
"Threshold": 10737418240
}
},
"ready": {
"status": "Healthy",
"description": null,
"data": {}
}
}
}
| Field | Type | Description |
|---|---|---|
status |
string |
Aggregate status — Healthy, Degraded, or Unhealthy. Mirrors the HTTP code per the table above. |
results |
object |
Map of check name → per-check result. Always contains memory, disk, and ready. |
results.<check>.status |
string |
Per-check status, one of Healthy, Degraded, Unhealthy. |
results.<check>.description |
string \| null |
Human-readable detail, populated when a check is not Healthy. |
results.<check>.data |
object |
Free-form per-check telemetry (e.g. observed value and configured threshold). Safe to ignore; useful for alert annotations. |
When a check fails, the corresponding entry switches to Unhealthy and the top-level status follows. Example during boot, before the graph has finished loading:
{
"status": "Unhealthy",
"results": {
"memory": { "status": "Healthy", "description": null, "data": { ... } },
"disk": { "status": "Healthy", "description": null, "data": { ... } },
"ready": { "status": "Unhealthy", "description": "Graph is still loading", "data": {} }
}
}
See the orchestrator-specific examples in Docker and Kubernetes for probe wiring.
Checking from CI/CD with curiosity-cli
For pipeline use — wait for a freshly-deployed workspace before running smoke tests or imports — use the CLI's wait-for command. It polls the readiness probe with backoff and exits 0 as soon as the workspace is live, so there is no need to write your own retry loop around /health:
# Gate the next step on the workspace being ready (default: up to 30 min)
curiosity-cli wait-for --server https://workspace.example.com/
# Tighten the timeout for short-lived CI runners
curiosity-cli wait-for --server https://workspace.example.com/ --max-timeout 300
wait-for does not require a token. Pair it with test if you also want to assert that a token is valid once the workspace is up. The full flag set (TLS bypass, alternate timeout semantics, environment variables) is documented on the CLI wait-for page.
High-signal alerts
Start with a small set; expand as you learn the workload.
| Alert | Trigger | Why |
|---|---|---|
| Workspace unavailable | Probe to /health returns non-200 for 2 consecutive minutes |
The single most important signal. |
| Container restart loop | More than 3 restarts in 15 min | Usually OOM, panic on bad input, or storage issue. |
| Search latency P95 regression | P95 > 2× the trailing 7-day baseline for 10 min | Often follows a bad index change or insufficient RAM. |
| Ingestion failure rate | Errors / total > 5% for 30 min on any active task | Catches expired source credentials and schema drift. |
| Index rebuild stuck | Rebuild task running > 6h with no progress event | Likely a slow embedding provider or a saturated disk. |
| Auth failures spike | More than 10× the trailing 1-hour baseline | Brute-force, credential stuffing, or a broken IdP. |
| Disk usage on graph volume | > 80% | Plan a resize before you wedge ingestion. |
What to monitor on the underlying infrastructure
The workspace metrics tell you what the application is doing; the infrastructure metrics tell you whether it has the resources to do it.
- CPU: sustained above 80% across the box is a sign you should size up before adding embeddings.
- Memory: the graph engine maps indexes into memory. Investigate any swap activity.
- Disk IOPS: peak IOPS during ingestion and rebuild — undersized disks throttle the whole system.
- Network: outbound to your LLM provider is a common bottleneck.
Audit trails
Sensitive admin actions are written to the workspace audit log:
- token creation and revocation;
- user/team membership changes;
- schema changes;
- endpoint and tool publication;
- SSO configuration changes.
Forward the audit log to your SIEM. Retention follows your platform's log policy; pick a value that satisfies your compliance window.
Next steps
- Security — the surrounding operational controls.
- Backup and restore — what monitoring should catch before you need backups.
- Troubleshooting — symptom-first responses to monitoring alerts.