Curiosity - Built-in Embedding Models

Built-in Embedding Models

Curiosity Workspace ships with three embedding models that run locally on CPU, plus an External mode that proxies to any OpenAI-compatible embedding endpoint. The local models are picked per index under Settings → AI Indexing, so the same field can be embedded by more than one model.

This page describes each built-in model and the trade-offs between them, and explains how to plug in an external embedding provider when none of the built-ins fit.

Quick comparison

Model	Dimensions	Max tokens	Languages	Bundled?	Use when
MiniLM	384	256	English	Bundled with the workspace	Lowest memory and fastest throughput; short English text.
Arctic XS	384	512	English	Bundled with the workspace	Slightly higher retrieval quality than MiniLM on the same hardware budget.
Harrier	640	32 768	Multilingual	Downloaded on first use	Multilingual corpora, long passages, or any case where retrieval quality dominates throughput.
External	depends on provider	depends on provider	depends on provider	Requires API credentials	Hosted models (OpenAI, Azure OpenAI, Cohere, Google, …), or a self-hosted OpenAI-compatible server.

The two bundled models (MiniLM, Arctic XS) ship inside the workspace package and are available immediately. Harrier downloads its safetensors weights (~540 MB) into the workspace's Models/Embeddings/ folder on first use; you can pre-download from Settings → AI Indexing → Embedding Models to avoid the wait when the first index is created. In Docker images the Harrier weights are pre-bundled under /app/models/harrier/, so the model is ready without a download.

Pros and cons

MiniLM (all-MiniLM-L6-v2)

Reference: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Architecture: 6-layer DistilBERT, mean-pooled, 22 M parameters, 384-dim vectors.

Pros

Smallest memory footprint of the built-ins (model fits in a few tens of MB).
Highest throughput on CPU — the fastest of the built-ins.
Ships embedded in the package; nothing to download or pre-bundle.
Mature, widely benchmarked baseline.

Cons

English-only training data — multilingual queries degrade quickly.
256-token context — long fields require chunking with significant overlap.
384-dim vectors give less retrieval headroom than higher-dim models on hard queries. ===

Arctic XS (Snowflake/snowflake-arctic-embed-xs)

Reference: https://huggingface.co/Snowflake/snowflake-arctic-embed-xs

Architecture: 6-layer transformer (~22 M parameters) tuned for retrieval, mean-pooled, 384-dim vectors.

Pros

Tuned specifically for retrieval (rather than generic sentence similarity), so it tends to rank relevant passages higher than MiniLM on the same corpus.
Same 384-dim vector layout as MiniLM — drop-in replacement on the storage side.
Bundled with the workspace.
512-token context — twice the window of MiniLM, fewer chunks per long field.

Cons

English-only.
Modestly slower than MiniLM on CPU (a small constant per call).
Trained for retrieval; "general similarity" (e.g., paraphrase clustering on prose) can be marginally weaker than MiniLM on some tasks. ===

Harrier (microsoft/harrier-oss-v1-270m)

Reference: https://huggingface.co/microsoft/harrier-oss-v1-270m

Architecture: Gemma3-based decoder, last-token pooled, 270 M parameters, 640-dim vectors. Curiosity Workspace ships the pure-managed variant (SentenceTransformers.Harrier.Small.Pure) — no native ONNX runtime is loaded; the forward pass runs entirely on System.Numerics.Tensors.TensorPrimitives, so it works in AOT and trimmed deployments.

Pros

Multilingual — covers a wide language mix without retraining.
32 768-token context — embeds whole pages or long transcripts without aggressive chunking.
640-dim vectors give noticeably better retrieval on hard or noisy corpora.
100 % managed implementation (no native dependency)
Supports instruction-prefixed query encoding (e.g., "Instruct: Given a web search query…\nQuery: …") so query and document encodings can be specialised.

Cons

Significantly slower than MiniLM / Arctic XS — see the benchmark below.
~540 MB of safetensors weights to download (or pre-bundle) on first use.
Decoder-style last-token pooling is more sensitive to truncation at the right edge of the window. ===

External

Use this when none of the built-ins fit — most commonly when you want a hosted model (OpenAI, Azure OpenAI, Cohere, Google) or a self-hosted OpenAI-compatible server. See Integrating an external provider below.

Pros

Any embedding dimension and any context length the provider supports.
Quality of large hosted models (text-embedding-3-large, etc.) is generally above any CPU-only built-in.
No CPU pressure on the workspace host — embedding throughput scales with the provider's quota, not your hardware.

Cons

Network round-trip per encode — latency depends on the provider.
Costs scale with usage; needs explicit token budgeting.
Data leaves the workspace host; check residency requirements before enabling.
Provider outages stop new vectors from being indexed until you fail over or recover. See LLM Configuration → Fallback and degradation. ===

Relative speed

Numbers below were measured on a developer laptop CPU at batch size 8. The absolute throughput depends entirely on hardware — only the relative speedups are portable.

Model	Vector size	Approx. relative throughput on CPU
MiniLM (`L6-v2`)	384	1.0× (baseline — fastest)
Arctic XS	384	≈ 0.85× (~15 % slower than MiniLM)
Harrier (Int8)	640	≈ 0.12× (~8× slower than MiniLM)

Read this as: if MiniLM embeds N chunks per hour on a given host, Arctic XS embeds roughly 0.85 × N and Harrier embeds roughly N / 8 on the same host. The Harrier slowdown is the cost of (a) a much larger model and (b) a 640-dim output vector. If you index large corpora with Harrier, plan extra background-indexing time.

For absolute numbers on your hardware, use the benchmark in sentence-transformers-sharp — the same benchmark project that produced the figures above.

Picking a model

A pragmatic rule of thumb:

English only, short fields, throughput-bound: start with MiniLM. Move to Arctic XS if you see retrieval misses.
English only, longer documents (paragraphs and up): Arctic XS — the 512-token window halves the number of chunks for typical KB articles.
Multilingual content, or long passages, or "the retrieval has to be good": Harrier. Budget for ~8× the indexing time of MiniLM.
You already run hosted LLMs and want best-in-class retrieval without spending CPU: External with a hosted embedding model (e.g. text-embedding-3-small / -large).

You can also mix: index the same field with both Arctic XS (cheap) and Harrier (high quality), and route queries to whichever scope matches the user's need.

Integrating an external provider

The External model talks to any provider that speaks the OpenAI embeddings API. That covers OpenAI directly, Azure OpenAI, Cohere via their OpenAI-compatible endpoint, Google Vertex AI's OpenAI shim, and any self-hosted runner that exposes the protocol (Ollama, vLLM, LM Studio, Text Generation Inference).

Create the index with the External model

Open Settings → AI Indexing.
Pick the External segment.
For the field you want to embed, click Create.

Configure the provider on the index

Click the Advanced settings button (⋯) on the new index.
Fill in:

External provider name — display label only (e.g. OpenAI, Azure, Self-hosted vLLM).
External provider URL — base URL of the embeddings endpoint. For OpenAI this is https://api.openai.com/v1. For an OpenAI-compatible local server, the host of your /v1 route (e.g. http://ollama.internal:11434/v1).
External provider model — the provider-side model identifier (e.g. text-embedding-3-small, voyage-3-lite, mxbai-embed-large).
External provider API key — encrypted at rest with MSK_GRAPH_MASTER_KEY.

Save. The workspace re-creates the encoder against the new settings on the next encode.

Initial Indexing

The first save kicks off an initial indexing that embeds all existing nodes of the configured node type. Depending on how much data is already in the workspace, this might take a while. You can monitor the indexing progress in the AI Indexing page.

Run the External provider behind a small reverse proxy (e.g. an Azure Front Door or your own gateway) so you can rotate keys, swap providers, and rate-limit without touching workspace configuration. The workspace only needs a stable URL + key pair.

The same field can be indexed by multiple models, but a single index is pinned to one model at creation time. Switching the External provider URL / model on an existing index does not re-embed the historical data — you must Clear the index from its advanced settings to force a rebuild against the new provider.

Operational notes

Bundled models (MiniLM, Arctic XS) load in a few hundred milliseconds on first use.
Harrier loads its safetensors weights into memory once per process; expect a one-time ~1–2 GB working set at load time (RAM, not disk).
For Docker deployments, the production pipeline pre-bundles Harrier under /app/models/harrier/harrier-oss-v1-270m.safetensors. The runtime resolver (EmbeddingModelDownloader.GetModelWeightsPath) prefers that location over the per-workspace storage folder when both exist.
Switching a configured index to a different model (or different External provider) requires a re-embed of the indexed data. Plan the rebuild — for large corpora it can take hours.

Built-in Embedding Models

Quick comparison

Pros and cons

Relative speed

Picking a model

Integrating an external provider

Create the index with the External model

Configure the provider on the index

Initial Indexing

Operational notes

Further reading

Referenced by