Curiosity

Built-in Embedding Models

Curiosity Workspace ships with three embedding models that run locally on CPU, plus an External mode that proxies to any OpenAI-compatible embedding endpoint. The local models are picked per index under Settings → AI Indexing, so the same field can be embedded by more than one model.

This page describes each built-in model and the trade-offs between them, and explains how to plug in an external embedding provider when none of the built-ins fit.

Quick comparison

Model Dimensions Max tokens Languages Bundled? Use when
MiniLM 384 256 English Bundled with the workspace Lowest memory and fastest throughput; short English text.
Arctic XS 384 512 English Bundled with the workspace Slightly higher retrieval quality than MiniLM on the same hardware budget.
Harrier 640 32 768 Multilingual Downloaded on first use Multilingual corpora, long passages, or any case where retrieval quality dominates throughput.
External depends on provider depends on provider depends on provider Requires API credentials Hosted models (OpenAI, Azure OpenAI, Cohere, Google, …), or a self-hosted OpenAI-compatible server.

The two bundled models (MiniLM, Arctic XS) ship inside the workspace package and are available immediately. Harrier downloads its safetensors weights (~540 MB) into the workspace's Models/Embeddings/ folder on first use; you can pre-download from Settings → AI Indexing → Embedding Models to avoid the wait when the first index is created. In Docker images the Harrier weights are pre-bundled under /app/models/harrier/, so the model is ready without a download.

Pros and cons

MiniLM (all-MiniLM-L6-v2)

Reference: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Architecture: 6-layer DistilBERT, mean-pooled, 22 M parameters, 384-dim vectors.

Pros

  • Smallest memory footprint of the built-ins (model fits in a few tens of MB).
  • Highest throughput on CPU — the fastest of the built-ins.
  • Ships embedded in the package; nothing to download or pre-bundle.
  • Mature, widely benchmarked baseline.

Cons

  • English-only training data — multilingual queries degrade quickly.
  • 256-token context — long fields require chunking with significant overlap.
  • 384-dim vectors give less retrieval headroom than higher-dim models on hard queries. ===
Arctic XS (Snowflake/snowflake-arctic-embed-xs)

Reference: https://huggingface.co/Snowflake/snowflake-arctic-embed-xs

Architecture: 6-layer transformer (~22 M parameters) tuned for retrieval, mean-pooled, 384-dim vectors.

Pros

  • Tuned specifically for retrieval (rather than generic sentence similarity), so it tends to rank relevant passages higher than MiniLM on the same corpus.
  • Same 384-dim vector layout as MiniLM — drop-in replacement on the storage side.
  • Bundled with the workspace.
  • 512-token context — twice the window of MiniLM, fewer chunks per long field.

Cons

  • English-only.
  • Modestly slower than MiniLM on CPU (a small constant per call).
  • Trained for retrieval; "general similarity" (e.g., paraphrase clustering on prose) can be marginally weaker than MiniLM on some tasks. ===
Harrier (microsoft/harrier-oss-v1-270m)

Reference: https://huggingface.co/microsoft/harrier-oss-v1-270m

Architecture: Gemma3-based decoder, last-token pooled, 270 M parameters, 640-dim vectors. Curiosity Workspace ships the pure-managed variant (SentenceTransformers.Harrier.Small.Pure) — no native ONNX runtime is loaded; the forward pass runs entirely on System.Numerics.Tensors.TensorPrimitives, so it works in AOT and trimmed deployments.

Pros

  • Multilingual — covers a wide language mix without retraining.
  • 32 768-token context — embeds whole pages or long transcripts without aggressive chunking.
  • 640-dim vectors give noticeably better retrieval on hard or noisy corpora.
  • 100 % managed implementation (no native dependency)
  • Supports instruction-prefixed query encoding (e.g., "Instruct: Given a web search query…\nQuery: …") so query and document encodings can be specialised.

Cons

  • Significantly slower than MiniLM / Arctic XS — see the benchmark below.
  • ~540 MB of safetensors weights to download (or pre-bundle) on first use.
  • Decoder-style last-token pooling is more sensitive to truncation at the right edge of the window. ===
External

Use this when none of the built-ins fit — most commonly when you want a hosted model (OpenAI, Azure OpenAI, Cohere, Google) or a self-hosted OpenAI-compatible server. See Integrating an external provider below.

Pros

  • Any embedding dimension and any context length the provider supports.
  • Quality of large hosted models (text-embedding-3-large, etc.) is generally above any CPU-only built-in.
  • No CPU pressure on the workspace host — embedding throughput scales with the provider's quota, not your hardware.

Cons

  • Network round-trip per encode — latency depends on the provider.
  • Costs scale with usage; needs explicit token budgeting.
  • Data leaves the workspace host; check residency requirements before enabling.
  • Provider outages stop new vectors from being indexed until you fail over or recover. See LLM Configuration → Fallback and degradation. ===

Relative speed

Numbers below were measured on a developer laptop CPU at batch size 8. The absolute throughput depends entirely on hardware — only the relative speedups are portable.

Model Vector size Approx. relative throughput on CPU
MiniLM (L6-v2) 384 1.0× (baseline — fastest)
Arctic XS 384 ≈ 0.85× (~15 % slower than MiniLM)
Harrier (Int8) 640 ≈ 0.12× (~8× slower than MiniLM)

Read this as: if MiniLM embeds N chunks per hour on a given host, Arctic XS embeds roughly 0.85 × N and Harrier embeds roughly N / 8 on the same host. The Harrier slowdown is the cost of (a) a much larger model and (b) a 640-dim output vector. If you index large corpora with Harrier, plan extra background-indexing time.

For absolute numbers on your hardware, use the benchmark in sentence-transformers-sharp — the same benchmark project that produced the figures above.

Picking a model

A pragmatic rule of thumb:

  • English only, short fields, throughput-bound: start with MiniLM. Move to Arctic XS if you see retrieval misses.
  • English only, longer documents (paragraphs and up): Arctic XS — the 512-token window halves the number of chunks for typical KB articles.
  • Multilingual content, or long passages, or "the retrieval has to be good": Harrier. Budget for ~8× the indexing time of MiniLM.
  • You already run hosted LLMs and want best-in-class retrieval without spending CPU: External with a hosted embedding model (e.g. text-embedding-3-small / -large).

You can also mix: index the same field with both Arctic XS (cheap) and Harrier (high quality), and route queries to whichever scope matches the user's need.

Integrating an external provider

The External model talks to any provider that speaks the OpenAI embeddings API. That covers OpenAI directly, Azure OpenAI, Cohere via their OpenAI-compatible endpoint, Google Vertex AI's OpenAI shim, and any self-hosted runner that exposes the protocol (Ollama, vLLM, LM Studio, Text Generation Inference).

1

Create the index with the External model

  1. Open Settings → AI Indexing.
  2. Pick the External segment.
  3. For the field you want to embed, click Create.
2

Configure the provider on the index

  1. Click the Advanced settings button () on the new index.
  2. Fill in:
  • External provider name — display label only (e.g. OpenAI, Azure, Self-hosted vLLM).
  • External provider URL — base URL of the embeddings endpoint. For OpenAI this is https://api.openai.com/v1. For an OpenAI-compatible local server, the host of your /v1 route (e.g. http://ollama.internal:11434/v1).
  • External provider model — the provider-side model identifier (e.g. text-embedding-3-small, voyage-3-lite, mxbai-embed-large).
  • External provider API key — encrypted at rest with MSK_GRAPH_MASTER_KEY.
  1. Save. The workspace re-creates the encoder against the new settings on the next encode.
3

Initial Indexing

The first save kicks off an initial indexing that embeds all existing nodes of the configured node type. Depending on how much data is already in the workspace, this might take a while. You can monitor the indexing progress in the AI Indexing page.

Run the External provider behind a small reverse proxy (e.g. an Azure Front Door or your own gateway) so you can rotate keys, swap providers, and rate-limit without touching workspace configuration. The workspace only needs a stable URL + key pair.

The same field can be indexed by multiple models, but a single index is pinned to one model at creation time. Switching the External provider URL / model on an existing index does not re-embed the historical data — you must Clear the index from its advanced settings to force a rebuild against the new provider.

Operational notes

  • Bundled models (MiniLM, Arctic XS) load in a few hundred milliseconds on first use.
  • Harrier loads its safetensors weights into memory once per process; expect a one-time ~1–2 GB working set at load time (RAM, not disk).
  • For Docker deployments, the production pipeline pre-bundles Harrier under /app/models/harrier/harrier-oss-v1-270m.safetensors. The runtime resolver (EmbeddingModelDownloader.GetModelWeightsPath) prefers that location over the per-workspace storage folder when both exist.
  • Switching a configured index to a different model (or different External provider) requires a re-embed of the indexed data. Plan the rebuild — for large corpora it can take hours.

Further reading

© 2026 Curiosity. All rights reserved.
Powered by Neko