Sentence Embeddings

The SentenceEmbeddingsIndex reads a text field off each node, runs it through a transformer encoder, and stores the resulting vector in an HNSW index. It's the default choice for semantic search and text-based recommendations.

The class lives in Mosaik.GraphDB.Indexes.SentenceEmbeddingsIndex. It implements ITextSimilarityIndex, so anything that consumes embeddings — IQuery.StartAtSimilarTextAsync, hybrid search, the similarity engine — works against it.

When to use it

Reach for sentence embeddings when the content of the field is what makes two nodes similar: product names, support case summaries, article bodies, code snippets, chat messages. If the words don't matter and only the graph structure does, use Graph Embeddings instead. If you already have vectors from a domain-specific model, use Raw Embeddings.

Built-in models

The index ships with three model choices, selected by SentenceEncoderModel:

Model	Source	Runs	Max chunk	Notes
`MiniLM`	`all-MiniLM-L6-v2` ONNX	In-process (CPU/GPU)	~256 tokens	Fast, low-RAM. Good default for short to medium fields.
`ArcticXS`	`snowflake-arctic-embed-xs` ONNX	In-process (CPU/GPU)	~512 tokens	Default for new indexes. Higher recall than MiniLM at comparable cost.
`External`	Any OpenAI-compatible embeddings endpoint (OpenAI, Azure OpenAI, Cohere, Google, custom URL)	Remote HTTP	4096 tokens (`ExternalSentenceEncoder.DefaultMaxChunkLength`)	Use when you need a hosted model — set provider, URL, model name, API key on the options.
`None`	—	—	—	Sentinel; disables embedding.

ArcticXS is SentenceEmbeddingsIndex.SentenceEncoderModelDefaultModel, the default selected by the AI Search controller and most UI flows.

The External model is configured via SentenceEmbeddingsIndexOptions:

opts.SentenceEncoderModel  = SentenceEncoderModel.External;
opts.ExternalProviderName  = "OpenAi";              // "OpenAi" | "AzureOpenAi" | "Cohere" | "Google" | "Anthropic"
opts.ExternalProviderUrl   = "";                    // override base URL if needed
opts.ExternalProviderModel = "text-embedding-3-small";
opts.ExternalProviderApiKey = "sk-…";

How indexing happens

flowchart LR N[(Node committed)] --> Q[Index queue] Q --> W[Background worker] W --> R[Read N.Field] R --> Chunk{ChunkText?} Chunk -- no --> Enc1[Encode whole field] Chunk -- yes --> Split[Split into ChunkLength windows<br/>with ChunkOverlap] Split --> Enc2[Encode each chunk] Enc1 --> HNSW[(HNSW per bucket)] Enc2 --> HNSW

Each indexed node turns into either one vector (no chunking) or many vectors (chunked, with the parent UID stored alongside each chunk). At query time, chunk hits dedupe back to the parent node.

Registering the index

Use the Graph.Indexes extension method. Two common cases:

1. Default settings — short field.

var index = await Graph.Indexes.AddSentenceEmbeddingsIndexAsync(
    nodeType:  N.Product.Type,
    fieldName: N.Product.Name,
    model:     SentenceEncoderModel.ArcticXS);

2. Long-form content with chunking + AI Search enabled.

var settings = new SettingsHolder();
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.ChunkText),         "True");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.ChunkOverlap),      "50");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.MaximumChunks),     "200");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.MinimumLength),     "20");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.EnableAISearch),    "True");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.InjectResultCutoff), "0.50");
settings.ManuallySet(nameof(SentenceEmbeddingsIndexOptions.RerankResultCutoff), "0.40");

var index = await Graph.Indexes.AddSentenceEmbeddingsIndexAsync(
    nodeType:  N.SupportCase.Type,
    fieldName: N.SupportCase.Content,
    model:     SentenceEncoderModel.ArcticXS,
    setting:   settings);

You can also register the same configuration through the admin UI under Settings → Indexes → Code Indexes → Sentence Embeddings, or via a migration. The UI sets the same SentenceEmbeddingsIndexOptions fields under the hood.

Index options reference

The SentenceEmbeddingsIndexOptions class drives every knob:

Option	Type	Default	Effect
`SentenceEncoderModel`	enum	`ArcticXS`	Which encoder runs. Changing it rebuilds the index.
`MinimumLength`	int	20	Skip values shorter than this; small strings hurt precision.
`InferencingCores`	int	1	ONNX session intra-op threads.
`ParallelInferencing`	int	1	How many texts encode concurrently.
`ChunkText`	bool	false	Split long values into overlapping windows before encoding.
`ChunkOverlap`	int	50	Tokens shared between adjacent chunks (only used when `ChunkText` is true).
`MaximumChunks`	int	100	Cap chunks per value to bound work on very long docs.
`EnableAISearch`	bool	false	Allow the search controller to inject these vectors into hybrid queries.
`InjectResultCutoff`	float	0.50	Min cosine similarity for a vector hit to enter the search result set.
`RerankResultCutoff`	float	0.40	Min similarity to reorder an existing BM25 hit.
`ResultsToExpand`	int	100	Top-N pulled from HNSW before cutoff is applied.
`Binary`	bool	false	Store vectors as 1-bit values (smaller, faster, less accurate).
`Buckets`	int	1	Shards the HNSW across N searchers — increase for very large corpora.
`FileTypes`	enum[]	null	Restrict to files matching these `FilesType` values (used for `_FileEntry`).
`SourcesToIndex`	string[]	null	Restrict to nodes ingested by specific connectors.
`ExternalProviderName` / `…Url` / `…Model` / `…ApiKey`	string	—	External encoder configuration (only used with `SentenceEncoderModel.External`).

Consuming the vectors

Once the index is built, you query it like any other text similarity index. From an endpoint:

// 1. Pure semantic retrieval (text → similar nodes).
var hits = await Q()
    .StartAtSimilarTextAsync(
        text:      "battery drains overnight",
        count:     20,
        nodeTypes: new[] { N.SupportCase.Type })
    .ContinueWith(t => t.Result.EmitWithScores());

// 2. From a seed node — find neighbors of an existing UID using only this index.
var index = Graph.Indexes
    .OfType<SentenceEmbeddingsIndex>(N.Product.Type)
    .First(i => i.FieldName == N.Product.Name);

return Q().StartAt(productUID)
          .Similar(IndexTypes.SentenceEmbeddingsIndex, index.UID, count: 20)
          .EmitWithScores();

See IQuery Similarity Search for the full set of IQuery methods that consume sentence embeddings, including how to filter by index UID.

Inspecting an index in code

foreach (var ix in Graph.Indexes.OfType<SentenceEmbeddingsIndex>())
{
    Logger.LogInformation("{Field} on {Type}: {Vectors} vectors, model={Model}",
        ix.FieldName, ix.NodeType, ix.VectorsCount, ix.SentenceEncoderModel);
}

ITextSimilarityIndex.PredictVectorAsync(text, ct) returns the raw vector for a string — useful when you want to feed the encoder's output into a non-Curiosity store, or to debug similarity scores.