UMAP

Reproducibility

By default Umap is non-deterministic — running the same input twice produces visually similar but coordinate-different embeddings. This is fine for production visualisation but inconvenient for unit tests, regression checks, and reproducible research. The fix is a seeded, single-threaded IProvideRandomValues implementation.

The IProvideRandomValues interface

public interface IProvideRandomValues
{
    bool IsThreadSafe { get; }
    int Next(int minValue, int maxValue);
    float NextFloat();
    void NextFloats(Span<float> buffer);
}

UMAP-Sharp passes any randomness through this interface, so swapping in a seeded generator is the only mechanism needed.

A deterministic generator

The unit-test project that ships with UMAP-Sharp uses Prando — a small seedable pseudo-random number generator — wrapped in an IProvideRandomValues. The implementation is short enough to drop into your own code:

using System;

public sealed class DeterministicRandomGenerator : IProvideRandomValues
{
    private readonly Prando _rnd;

    public DeterministicRandomGenerator(int seed) => _rnd = new Prando(seed);

    // Force single-threaded SGD — multi-threaded execution is non-deterministic
    // even with a seeded RNG, because thread interleaving is not reproducible.
    public bool IsThreadSafe => false;

    public int Next(int minValue, int maxValue) => _rnd.NextInt(minValue, maxValue - 1);

    public float NextFloat() => (float)_rnd.NextDouble();

    public void NextFloats(Span<float> buffer)
    {
        for (var i = 0; i < buffer.Length; i++)
        {
            buffer[i] = (float)_rnd.NextDouble();
        }
    }
}

Use it the same way you would the default generator:

var umap = new Umap(random: new DeterministicRandomGenerator(seed: 42));
var epochs = umap.InitializeFit(vectors);
for (var i = 0; i < epochs; i++) umap.Step();

var embedding = umap.GetEmbedding();
// Re-running with seed 42 on the same vectors gives exactly this embedding.
`IsThreadSafe` must be false

A seeded RNG by itself is not enough. UMAP only multi-threads when IsThreadSafe is true, and parallel execution introduces non-deterministic ordering of SGD updates regardless of the RNG. For reproducibility you need both a seeded generator and IsThreadSafe => false.

Why parallelism breaks reproducibility

The SGD inner loop visits sample pairs in parallel and updates shared embedding coordinates without locks. The exact interleaving of those updates — and therefore the resulting embedding — depends on thread scheduling, not the RNG. Even with a seed, two parallel runs diverge.

IsThreadSafe => false triggers the sequential fallback path in Umap.Step():

if (_random.IsThreadSafe)
{
    Parallel.For(0, _optimizationState.EpochsPerSample.Length, Iterate);
}
else
{
    for (var i = 0; i < _optimizationState.EpochsPerSample.Length; i++)
    {
        Iterate(i);
    }
}

Sequential execution makes the run deterministic at the cost of multi-core utilisation.

Unit-test pattern

The pattern used by UMAP-Sharp's own test suite:

[Fact]
public void EmbeddingIsReproducible()
{
    var umap = new Umap(random: new DeterministicRandomGenerator(42));

    var epochs = umap.InitializeFit(TestData);
    for (var i = 0; i < epochs; i++) umap.Step();

    var embedding = umap.GetEmbedding();

    AssertNestedFloatArraysEquivalent(ExpectedEmbedding, embedding);
}

static void AssertNestedFloatArraysEquivalent(float[][] expected, float[][] actual)
{
    Assert.Equal(expected.Length, actual.Length);
    for (var i = 0; i < expected.Length; i++)
    {
        Assert.Equal(expected[i].Length, actual[i].Length);
        for (var j = 0; j < expected[i].Length; j++)
        {
            Assert.True(Math.Abs(expected[i][j] - actual[i][j]) < 1e-5);
        }
    }
}

A small floating-point tolerance (1e-5) absorbs the tiny differences that creep in from FMA / SIMD reordering across CPUs.

Cross-platform caveats

Even with a seeded RNG and single-threaded SGD, the output is only guaranteed to be deterministic on the same CPU architecture and runtime version. Differences in System.Numerics.Vector<float> width (SSE vs. AVX vs. ARM NEON) can cause sub-1e-5 drift in the SIMD dot-product. Pin your tests to a single configuration, or use a tolerance.

When you do not need full determinism

A common middle ground: seed the RNG so that you get the same "shape" of embedding across runs, but keep multi-threaded SGD on. The result is not bit-identical but the cluster structure is stable enough for golden-image visual regression tests:

public sealed class SeededParallelRandom : IProvideRandomValues
{
    private readonly Prando _rnd;
    public SeededParallelRandom(int seed) => _rnd = new Prando(seed);

    public bool IsThreadSafe => true;  // accept non-determinism for parallelism

    // ... rest identical to DeterministicRandomGenerator
}

This gives you reproducible fit (the part that depends most on the RNG seed) while letting optimization fan out to all cores.

Next

© 2026 UMAP. All rights reserved.