Reproducibility
By default Umap is non-deterministic — running the same input twice produces visually similar but coordinate-different embeddings. This is fine for production visualisation but inconvenient for unit tests, regression checks, and reproducible research. The fix is a seeded, single-threaded IProvideRandomValues implementation.
The IProvideRandomValues interface
public interface IProvideRandomValues
{
bool IsThreadSafe { get; }
int Next(int minValue, int maxValue);
float NextFloat();
void NextFloats(Span<float> buffer);
}
UMAP-Sharp passes any randomness through this interface, so swapping in a seeded generator is the only mechanism needed.
A deterministic generator
The unit-test project that ships with UMAP-Sharp uses Prando — a small seedable pseudo-random number generator — wrapped in an IProvideRandomValues. The implementation is short enough to drop into your own code:
using System;
public sealed class DeterministicRandomGenerator : IProvideRandomValues
{
private readonly Prando _rnd;
public DeterministicRandomGenerator(int seed) => _rnd = new Prando(seed);
// Force single-threaded SGD — multi-threaded execution is non-deterministic
// even with a seeded RNG, because thread interleaving is not reproducible.
public bool IsThreadSafe => false;
public int Next(int minValue, int maxValue) => _rnd.NextInt(minValue, maxValue - 1);
public float NextFloat() => (float)_rnd.NextDouble();
public void NextFloats(Span<float> buffer)
{
for (var i = 0; i < buffer.Length; i++)
{
buffer[i] = (float)_rnd.NextDouble();
}
}
}
Use it the same way you would the default generator:
var umap = new Umap(random: new DeterministicRandomGenerator(seed: 42));
var epochs = umap.InitializeFit(vectors);
for (var i = 0; i < epochs; i++) umap.Step();
var embedding = umap.GetEmbedding();
// Re-running with seed 42 on the same vectors gives exactly this embedding.
`IsThreadSafe` must be false
A seeded RNG by itself is not enough. UMAP only multi-threads when IsThreadSafe is true, and parallel execution introduces non-deterministic ordering of SGD updates regardless of the RNG. For reproducibility you need both a seeded generator and IsThreadSafe => false.
Why parallelism breaks reproducibility
The SGD inner loop visits sample pairs in parallel and updates shared embedding coordinates without locks. The exact interleaving of those updates — and therefore the resulting embedding — depends on thread scheduling, not the RNG. Even with a seed, two parallel runs diverge.
IsThreadSafe => false triggers the sequential fallback path in Umap.Step():
if (_random.IsThreadSafe)
{
Parallel.For(0, _optimizationState.EpochsPerSample.Length, Iterate);
}
else
{
for (var i = 0; i < _optimizationState.EpochsPerSample.Length; i++)
{
Iterate(i);
}
}
Sequential execution makes the run deterministic at the cost of multi-core utilisation.
Unit-test pattern
The pattern used by UMAP-Sharp's own test suite:
[Fact]
public void EmbeddingIsReproducible()
{
var umap = new Umap(random: new DeterministicRandomGenerator(42));
var epochs = umap.InitializeFit(TestData);
for (var i = 0; i < epochs; i++) umap.Step();
var embedding = umap.GetEmbedding();
AssertNestedFloatArraysEquivalent(ExpectedEmbedding, embedding);
}
static void AssertNestedFloatArraysEquivalent(float[][] expected, float[][] actual)
{
Assert.Equal(expected.Length, actual.Length);
for (var i = 0; i < expected.Length; i++)
{
Assert.Equal(expected[i].Length, actual[i].Length);
for (var j = 0; j < expected[i].Length; j++)
{
Assert.True(Math.Abs(expected[i][j] - actual[i][j]) < 1e-5);
}
}
}
A small floating-point tolerance (1e-5) absorbs the tiny differences that creep in from FMA / SIMD reordering across CPUs.
Cross-platform caveats
Even with a seeded RNG and single-threaded SGD, the output is only guaranteed to be deterministic on the same CPU architecture and runtime version. Differences in System.Numerics.Vector<float> width (SSE vs. AVX vs. ARM NEON) can cause sub-1e-5 drift in the SIMD dot-product. Pin your tests to a single configuration, or use a tolerance.
When you do not need full determinism
A common middle ground: seed the RNG so that you get the same "shape" of embedding across runs, but keep multi-threaded SGD on. The result is not bit-identical but the cluster structure is stable enough for golden-image visual regression tests:
public sealed class SeededParallelRandom : IProvideRandomValues
{
private readonly Prando _rnd;
public SeededParallelRandom(int seed) => _rnd = new Prando(seed);
public bool IsThreadSafe => true; // accept non-determinism for parallelism
// ... rest identical to DeterministicRandomGenerator
}
This gives you reproducible fit (the part that depends most on the RNG seed) while letting optimization fan out to all cores.
Next
- Parallelization — the full story on threading.
- Progress Reporting — observe deterministic runs without affecting them.