Curiosity - Storage and indexing

Storage and indexing

How Curiosity Workspace persists data on disk, how indexes are laid out alongside it, and what that means for capacity planning, backups, and recovery.

Diagram showing a vertical stack of storage components with labeled blocks and arrows indicating data flow.

On-disk layout

A running workspace writes to a single root directory pointed at by MSK_GRAPH_STORAGE. Underneath it, the workspace organizes data into several logical areas:

$MSK_GRAPH_STORAGE/
├── graph/                # nodes, edges, properties (the graph database)
├── text-index/           # text search index
├── vector-index/         # vector / embedding index
├── parsers/              # parsed documents (intermediate Document nodes)
├── audit/                # audit log
└── journal/              # write journal (also writable to MSK_GRAPH_JOURNAL_FOLDER)

$MSK_GRAPH_BACKUP_FOLDER/  # rolling backups, if configured
$MSK_LOG_PATH/             # logs, if file-based logging is configured

The directory structure may evolve between releases; treat the whole MSK_GRAPH_STORAGE directory as the unit of backup and restore.

What's in memory vs on disk

Data	In memory	On disk
Graph nodes and edges (working set)	Yes (memory-mapped)	Yes (persistent storage)
Active text-index segments	Yes (mmap)	Yes
Active vector-index segments	Yes (mmap)	Yes
Parsed file content	No (streamed from disk on demand)	Yes
Backups	No	Yes
Journal entries	Buffered	Yes (durable on commit)

The engine memory-maps graph and index segments and relies on the OS page cache to keep hot data resident. More RAM = larger working set staying in memory = lower query latency. That's the primary scaling lever before you reach for sharding.

Persistence guarantees

A successful await graph.CommitPendingAsync() durably writes the change to the journal before returning. A crash immediately after the call cannot lose the change.
Index updates land after the commit returns, but before the workspace acknowledges any search for the changed nodes. There's a brief window where a node is in the graph but not in the search index — application code that needs read-your-writes should query the graph directly, not the index.
The journal is replayed on every startup. A workspace that boots without errors is consistent.

Sizing

Rough estimates to budget storage at design time:

Workload	Indicative size
Graph (nodes + edges)	~ 1.5 × the raw property bytes you commit
Text index	~ 1× the sum of indexed text bytes
Vector index	~ (embedded text bytes) × (embedding dims × 4 bytes) ÷ (chunk size)
Journal headroom	5 GB minimum, 20% of graph in steady state
Backups (rolling)	1× the live graph for the most recent snapshot

A starter PVC of 200 GB is appropriate for hundreds of thousands of documents with embeddings; scale up before you hit 80%.

Storage class recommendations

Platform	Recommended	Notes
Linux host with local disk	NVMe SSD	Fastest.
Linux host with attached disk	`gp3` (AWS), `Premium SSD` (Azure), `pd-ssd` (GCP)	Block storage, low latency.
Kubernetes	`ReadWriteOnce` SSD-class	Always single-writer.
Shared filesystems (NFS / EFS)	Tolerated for non-prod	Slower; the index files are sensitive to latency.
Object storage (S3, GCS, Azure Blob)	Not supported as primary	Use only for backups via a sync sidecar.

Backups

Snapshot the volume that hosts MSK_GRAPH_STORAGE. Because reads are lock-free, a snapshot taken while the workspace is running is consistent.
For platforms without native snapshots, set MSK_GRAPH_BACKUP_FOLDER and schedule a backup task that writes consistent point-in-time copies into it; then ship the folder off-host.
Always back up the secrets (MSK_JWT_KEY, MSK_GRAPH_MASTER_KEY, MSK_ADMIN_PASSWORD, MSK_LICENSE) separately. A graph backup is useless if you can't decrypt it.

See Backup and restore.

Re-creating indexes

If a text or vector index ever needs rebuilding (because you changed the recipe, switched embedding providers, or restored an older backup onto a newer workspace), the engine does this in the background:

A new index is built alongside the old one.
Queries continue against the old index until the new one finishes.
The engine swaps atomically — no downtime.

See Reindexing and re-embedding for the operational details.

Storage on different platforms

Docker host: bind-mount a local SSD directory at /data. See Docker.
Kubernetes: volumeClaimTemplates provisioning a ReadWriteOnce block-storage PVC. See Kubernetes.
AWS: EBS gp3, snapshot via DLM. See AWS.
Azure: Premium SSD managed disk. See Azure.
GCP: SSD Persistent Disk. See GCP.
OpenShift: ODF / Ceph RBD / vSphere CSI / platform default. See OpenShift.
Windows: NTFS volume on a dedicated SSD. See Windows.