HNSW-Sharp

Performance

HNSW-Sharp is already fast out of the box. This page covers the levers that make a meaningful difference once you've outgrown the defaults.

Pick the right distance function

The biggest single win for float[] embeddings: switch to CosineDistance.SIMDForUnits and pre-normalise your vectors.

Distance Relative throughput on 768-d vectors
CosineDistance.NonOptimized
CosineDistance.ForUnits ~2×
CosineDistance.SIMD ~3–4×
CosineDistance.SIMDForUnits ~5–6×

The exact multipliers depend on the CPU, vector width, and runtime — but SIMDForUnits is consistently the fastest available cosine.

If you can't normalise (the vectors must survive in non-unit form for some other reason), CosineDistance.SIMD is the next-best.

The distance cache

During graph construction, every insertion needs many distance comparisons against existing nodes — and the same pairs come up repeatedly. The library caches them when EnableDistanceCacheForConstruction = true (the default).

var parameters = new SmallWorldParameters
{
    EnableDistanceCacheForConstruction = true,
    InitialDistanceCacheSize = 8 * 1024 * 1024,  // 8M entries
};

InitialDistanceCacheSize is pre-allocated upfront. For very large builds, pre-sizing avoids resize-driven GC pressure. For very small builds, it's wasted memory — set to 0 for sub-1k-item indexes.

After the graph is built, the cache is no longer useful. Free it with:

graph.DisableDistanceCache();

Do this before serialising or running search workloads — it's pure memory savings.

Optimise the graph layout

OptimizeIfNeeded (and OptimizeIfNeeded(force: true)) flatten the layer-zero connection arrays into a single contiguous block. Two effects:

  • Cache locality improves — search walks now touch contiguous memory rather than chasing per-node arrays.
  • GC pressure drops on long-lived processes — one big allocation instead of N small ones.
graph.OptimizeIfNeeded();

The library calls this automatically before serialisation. Calling it manually after a big batch of AddItems and before a sustained query workload is sometimes worthwhile.

Memory layout — what to expect

For an index with N items built with M = 16:

  • Layer 0: ~32 × 4 × N bytes = 128 N bytes for edges.
  • Higher layers: roughly 1/M as many nodes per level, so about 10 N bytes total for all higher layers.
  • Items themselves: not stored in the graph, but you typically keep them in memory anyway.
  • Distance cache: optional; can dominate during build.

Rule of thumb: a 1M × 768-d float index is ~3 GB including the vectors, ~150 MB excluding them.

Single-threaded mode

If your application serialises access to the graph itself, skip the internal RW-lock:

var graph = new SmallWorld<float[], float>(
    distance, generator, parameters,
    threadSafe: false);

This shaves the lock cost off every KNNSearch and AddItems call — typically a few percent, more on tiny graphs where the lock cost is a larger fraction of total work.

Don't do this unless you can guarantee external synchronisation. The lock is cheap insurance.

Build-time tuning checklist

  1. Pre-normalise vectors and use CosineDistance.SIMDForUnits.
  2. Pick M for your accuracy target (16 is a good default for 100–1000-d embeddings).
  3. Set InitialItemsSize to your expected dataset size to avoid resizing the items list.
  4. Build in chunks of 10k–100k items.
  5. Call graph.OptimizeIfNeeded() and graph.DisableDistanceCache() after the last AddItems.
  6. Serialise the result so you don't pay the build cost again.

Query-time tuning checklist

  1. Pick a recall target on a held-out query set with known ground truth.
  2. Sweep EfSearch over {32, 64, 128, 256, 512} and record latency + recall.
  3. Pick the smallest EfSearch that meets the target.
  4. If filtering is involved, re-run the sweep with the predicate active — selective filters can change the sweet spot.

When HNSW is not the bottleneck

For low-N indexes (< a few thousand items), bruteforce is often faster than HNSW — and gives exact results. Check before you tune. A naive OrderBy(distance) loop over a few thousand items typically runs in microseconds.

Referenced by

© 2026 HNSW-Sharp. All rights reserved.