Caching
A connector that walks a million-row source from scratch every run wastes time, network, and indexing work on records the workspace has already seen. HashCache is a small on-disk cache (LiteDB-backed) that fingerprints each source record and short-circuits ingestion when nothing has changed.
The pattern is: hash the immutable shape of a record, check the cache, ingest only on a miss, then enqueue the hash. After the graph commit succeeds, commit the cache. On the next run the same record is a one-lookup no-op.
When to use it
Use HashCache when |
Skip it when |
|---|---|
| The source has no usable "modified since" cursor, or you can't trust it. | The source exposes a reliable watermark — paginate by updated_at. |
| Records are immutable once written (events, log lines, finalized PDFs). | Records mutate often and you want the latest version on every run. |
| The cost of deciding to skip is much lower than the cost of ingesting. | The mapper is trivial and the source is cheap to re-read. |
| You re-ingest the same full export repeatedly (daily CSV drop, S3 sync). | You already de-duplicate upstream. |
HashCache is a complement to Idempotency, not a replacement. AddOrUpdate keyed on a stable [Key] already prevents duplicates; HashCache prevents the work of re-ingesting unchanged records.
The API
using Curiosity.Library;
using var cache = HashCache.Initialize("cache/csv-employees.db");
// Sync
cache.IfNotCached(connectorVersion: 1, data: row, action: r =>
{
graph.AddOrUpdate(Map(r));
});
// Async
await cache.IfNotCachedAsync(connectorVersion: 1, data: row, action: async r =>
{
await graph.AddOrUpdateAsync(Map(r));
});
// Persist the cache once the graph commit has succeeded.
await graph.CommitPendingAsync();
await cache.CommitPendingHashesAsync();
The connectorVersion is an integer you bump whenever the mapper changes shape (new property, different edge layout, schema migration). When it changes, every cached hash is implicitly invalidated and the next run re-ingests everything — without you needing to delete the cache file.
IfNotCached is the convenience wrapper. Internally it calls Hash, ContainsHash, action, and EnqueueHash. Use the lower-level pieces directly if you want to batch the hash computation or run the action conditionally on something else.
How the hash is computed
Hash<T>(data) serializes data to compact JSON with Newtonsoft.Json and takes a UID128 (128-bit) hash. Two implications:
- What you hash, you trust. Pass the exact shape that drives ingestion — the source DTO, not the source-plus-derived-fields-plus-current-timestamp. If you stir in
DateTime.UtcNow, every run is a miss. - Property order matters only as much as Newtonsoft's serializer ordering matters — which is to say: it's stable for a given type, but if you switch from a
classto arecordwith reordered properties, every existing hash becomes a miss.
For records where only a subset of fields drive ingestion, hash a projection:
// Only these fields decide whether the node needs re-writing.
cache.IfNotCached(1, new { row.EmployeeId, row.Name, row.Department, row.HiredAt }, _ =>
{
graph.AddOrUpdate(MapEmployee(row));
});
Commit ordering matters
The cache must only persist hashes for work that actually landed in the graph. The correct order on every run is:
If the graph commit fails, do not call CommitPendingHashesAsync — drop the in-memory pending hashes by disposing the cache or restarting the run, so the next attempt re-ingests the failed batch.
try
{
foreach (var row in source) cache.IfNotCached(1, row, r => graph.AddOrUpdate(Map(r)));
await graph.CommitPendingAsync();
await cache.CommitPendingHashesAsync(); // only on success
}
catch
{
// pending hashes are discarded when `cache` is disposed without a commit
throw;
}
Versioning the cache
Bump connectorVersion whenever a re-run with the same source must re-process every record. Typical triggers:
- The
[Node]POCO gained a new[Property]you now populate. - The mapper started emitting an extra
Linkper record. - A schema migration changed the
[Key]derivation.
const int ConnectorVersion = 3; // bump when Map() changes shape
cache.IfNotCached(ConnectorVersion, row, r => graph.AddOrUpdate(Map(r)));
Old entries stay on disk but never match again. You can leave them; LiteDB compacts on commit. Delete the file if you want a clean slate.
Where to store the cache file
var cachePath = Path.Combine(AppContext.BaseDirectory, "cache", "csv-employees.db");
Directory.CreateDirectory(Path.GetDirectoryName(cachePath)!);
using var cache = HashCache.Initialize(cachePath);
Treat the cache file as per-connector state, not part of the source data. Keep it next to the connector binary, in a Docker volume, or in CI's cache directory — anywhere durable between runs. Losing the file is not a correctness problem; the next run just re-ingests everything (every record is a AddOrUpdate, so the graph stays consistent).
Do not check the cache file into source control. Add *.db to .gitignore next to your connector.
Concurrency
HashCache serializes access internally — ContainsHash, EnqueueHash, and CommitPendingHashesAsync are safe to call from multiple threads. You can share one cache instance across parallel ingest workers:
using var cache = HashCache.Initialize(cachePath);
await Parallel.ForEachAsync(source, async (row, _) =>
{
await cache.IfNotCachedAsync(1, row, async r =>
{
await graph.AddOrUpdateAsync(Map(r));
});
});
await graph.CommitPendingAsync();
await cache.CommitPendingHashesAsync();
The same caveat as Performance applies: two workers ingesting the same key race on the graph side. The cache itself is safe.
Measuring effectiveness
Track hits and misses to make sure the cache is actually doing work:
var hits = 0;
var misses = 0;
foreach (var row in source)
{
var hash = cache.Hash(row);
if (cache.ContainsHash(ConnectorVersion, hash)) { hits++; continue; }
graph.AddOrUpdate(Map(row));
cache.EnqueueHash(ConnectorVersion, hash);
misses++;
}
logger.LogInformation("HashCache: {Hits} hits / {Misses} misses", hits, misses);
On the first run every record is a miss; the second run against an unchanged source should be all hits. If steady-state hit rate is low, the hashed shape is probably picking up volatile fields — narrow it down to immutable ones.
Cross-links
- Idempotency — the underlying guarantee that makes skipping safe.
- Performance — auto-commit, pause indexing, parallel ingest.
- Examples → CSV — a baseline CSV connector. The
CsvCachedSamplerecipe layersHashCacheon top of it.