Performance
HNSW-Sharp is already fast out of the box. This page covers the levers that make a meaningful difference once you've outgrown the defaults.
Pick the right distance function
The biggest single win for float[] embeddings: switch to CosineDistance.SIMDForUnits and pre-normalise your vectors.
| Distance | Relative throughput on 768-d vectors |
|---|---|
CosineDistance.NonOptimized |
1× |
CosineDistance.ForUnits |
~2× |
CosineDistance.SIMD |
~3–4× |
CosineDistance.SIMDForUnits |
~5–6× |
The exact multipliers depend on the CPU, vector width, and runtime — but SIMDForUnits is consistently the fastest available cosine.
If you can't normalise (the vectors must survive in non-unit form for some other reason), CosineDistance.SIMD is the next-best.
The distance cache
During graph construction, every insertion needs many distance comparisons against existing nodes — and the same pairs come up repeatedly. The library caches them when EnableDistanceCacheForConstruction = true (the default).
var parameters = new SmallWorldParameters
{
EnableDistanceCacheForConstruction = true,
InitialDistanceCacheSize = 8 * 1024 * 1024, // 8M entries
};
InitialDistanceCacheSize is pre-allocated upfront. For very large builds, pre-sizing avoids resize-driven GC pressure. For very small builds, it's wasted memory — set to 0 for sub-1k-item indexes.
After the graph is built, the cache is no longer useful. Free it with:
graph.DisableDistanceCache();
Do this before serialising or running search workloads — it's pure memory savings.
Optimise the graph layout
OptimizeIfNeeded (and OptimizeIfNeeded(force: true)) flatten the layer-zero connection arrays into a single contiguous block. Two effects:
- Cache locality improves — search walks now touch contiguous memory rather than chasing per-node arrays.
- GC pressure drops on long-lived processes — one big allocation instead of N small ones.
graph.OptimizeIfNeeded();
The library calls this automatically before serialisation. Calling it manually after a big batch of AddItems and before a sustained query workload is sometimes worthwhile.
Memory layout — what to expect
For an index with N items built with M = 16:
- Layer 0: ~
32 × 4 × Nbytes = 128 N bytes for edges. - Higher layers: roughly
1/Mas many nodes per level, so about 10 N bytes total for all higher layers. - Items themselves: not stored in the graph, but you typically keep them in memory anyway.
- Distance cache: optional; can dominate during build.
Rule of thumb: a 1M × 768-d float index is ~3 GB including the vectors, ~150 MB excluding them.
Single-threaded mode
If your application serialises access to the graph itself, skip the internal RW-lock:
var graph = new SmallWorld<float[], float>(
distance, generator, parameters,
threadSafe: false);
This shaves the lock cost off every KNNSearch and AddItems call — typically a few percent, more on tiny graphs where the lock cost is a larger fraction of total work.
Don't do this unless you can guarantee external synchronisation. The lock is cheap insurance.
Build-time tuning checklist
- Pre-normalise vectors and use
CosineDistance.SIMDForUnits. - Pick
Mfor your accuracy target (16is a good default for 100–1000-d embeddings). - Set
InitialItemsSizeto your expected dataset size to avoid resizing the items list. - Build in chunks of 10k–100k items.
- Call
graph.OptimizeIfNeeded()andgraph.DisableDistanceCache()after the lastAddItems. - Serialise the result so you don't pay the build cost again.
Query-time tuning checklist
- Pick a recall target on a held-out query set with known ground truth.
- Sweep
EfSearchover{32, 64, 128, 256, 512}and record latency + recall. - Pick the smallest
EfSearchthat meets the target. - If filtering is involved, re-run the sweep with the predicate active — selective filters can change the sweet spot.
When HNSW is not the bottleneck
For low-N indexes (< a few thousand items), bruteforce is often faster than HNSW — and gives exact results. Check before you tune. A naive OrderBy(distance) loop over a few thousand items typically runs in microseconds.