Types of NLP Models
A side-by-side look at the four kinds of extraction model Curiosity Workspace can run: Spotters, Pattern Spotters, ML / NER, and LLM extraction. Pick the right one — or the right combination — based on your vocabulary, volume, and accuracy needs.
The first two — Spotter and Pattern Spotter — are first-class model types you create and edit in the admin UI under Settings → Entity Capturing (tabs Spotters and Patterns). The other two — ML / NER and LLM extraction — are integrated programmatically: you call them from a data connector, a custom code index, or a custom endpoint, and write the results back to the graph yourself.
For when extraction itself is the right tool, see NLP overview. For configuration, see Entity extraction.
Comparison at a glance
| Aspect | Spotter | Pattern Spotter | ML / NER | LLM extraction |
|---|---|---|---|---|
| UI surface | Entity Capturing → Spotters | Entity Capturing → Patterns | None (external library + code) | None (custom code index / endpoint) |
| Best for | Finite vocabularies | Structured identifiers | Generic types (PERSON, ORG, …) | Open-ended fields, intents, summaries |
| Precision | Very high | Medium–high | Medium | Medium–high (depends on prompt) |
| Recall | Limited to the list | Wide on shape, miss novel | Wide for trained types | Widest |
| Latency | Microseconds | Microseconds | Milliseconds | Hundreds of ms — seconds |
| Cost / 1k docs | Negligible | Negligible | CPU/GPU compute | Provider API call (per token) |
| Maintenance | Update list as vocab grows | Update regex on new shapes | Re-train when accuracy drifts | Update prompt; track model versions |
| Determinism | Deterministic | Deterministic | Mostly deterministic | Stochastic (set temperature = 0) |
| Multilingual | Per-language dictionary | Pattern-dependent | Per-language model | Inherent (most modern LLMs) |
| Confidence | Binary or boost-weighted | Binary or context-weighted | Probabilistic score | Self-reported, unreliable |
Spotter
Curated list of canonical terms with aliases, matched against tokenized text. Sometimes called a dictionary or gazetteer; in the UI it's just Spotter. A spotter built from a graph node type (so the vocabulary stays in sync with what's already in the graph) is offered through the New Spotter button on Settings → Entity Capturing → Spotters (and from Management → Data → node type → Capture → New Spotter).
Pick this when:
- Your vocabulary is enumerable — product catalog, customer list, internal team names.
- You need deterministic results for compliance.
- You can keep the list updated as the catalog changes.
Avoid when:
- The space is open-ended (every possible person name, every potential brand).
- Your alias coverage is poor — you'll under-extract.
Configuration: see Entity extraction → Spotter and Spotter models.
Pattern Spotter
Regex-style matching for codes and identifiers. Created from Settings → Entity Capturing → Patterns → New Pattern (or Management → Data → node type → Capture → New Pattern Spotter).
Pick this when:
- Your entity has a consistent format —
TICKET-12345,0xDEADBEEF,SKU-AB-1234. - The format is unambiguous in context.
Avoid when:
- The pattern is too generic (
\d{4}matches years, postcodes, ages, asset IDs — all of them). - The format changes across vendors or sources without you knowing.
Always pair patterns with context constraints (context_must_include) and exclusions (context_includes → reject) to control over-firing.
ML / NER
Pre-trained named-entity recognizers — spaCy, Stanza, Hugging Face transformer pipelines, or any other library you already use. There is no built-in ML / NER type in the Entity Capturing UI. Instead, run the model externally as part of your ingestion code and write the entities back through the curiosity data connector library: fetch the source data, parse it with the NER model of your choice, and link the captured entities into the graph using the same add_or_update_by_key / link calls a normal connector uses.
Pick this when:
- You need generic types and don't have time to curate dictionaries (PERSON, ORG, LOCATION, MONEY, DATE).
- Your text is well-formed prose (news, emails, articles).
- You already have a Python (or C#) NER toolchain you're happy with.
Avoid when:
- Domain ambiguity is high — "Apple" in a tech corpus is unambiguous; in a grocery corpus it isn't.
- You need to extract specific business entities, not generic types.
Mix with Spotters: run the Spotter first for known entities and let the external NER pick up the long tail.
Example — spaCy + the curiosity Python package
The connector below fetches news articles from a source API, runs spaCy NER on each body, and writes both the article and the captured entities (Person, Organization, Location) into the graph as linked nodes:
# pip install curiosity spacy
# python -m spacy download en_core_web_md
import os
import requests
import spacy
from curiosity import Graph
NLP = spacy.load("en_core_web_md")
SPACY_TO_NODE = {
"PERSON": "Person",
"ORG": "Organization",
"GPE": "Location",
"LOC": "Location",
}
NODE_SCHEMAS = [
{"type": "Article", "key": "Id", "properties": ["Title", "Body"], "timestamp": "Published"},
{"type": "Person", "key": "Name"},
{"type": "Organization", "key": "Name"},
{"type": "Location", "key": "Name"},
]
EDGE_SCHEMAS = [
{"name": "Mentions", "reverse": "MentionedIn"},
]
def ingest_article(g: Graph, src: dict) -> None:
g.add_or_update_by_key(
node_type="Article",
key=src["id"],
content={"Title": src["title"], "Body": src["body"], "Published": src["published"]},
)
doc = NLP(src["body"])
seen: set[tuple[str, str]] = set()
for ent in doc.ents:
node_type = SPACY_TO_NODE.get(ent.label_)
if node_type is None:
continue
name = ent.text.strip()
if not name or (node_type, name) in seen:
continue
seen.add((node_type, name))
g.add_or_update_by_key(node_type=node_type, key=name, content={"Name": name})
g.link(
from_type="Article", from_key=src["id"],
to_type=node_type, to_key=name,
edge="Mentions", reverse="MentionedIn",
)
with Graph.connect(
endpoint=os.environ["CURIOSITY_ENDPOINT"],
token=os.environ["CURIOSITY_TOKEN"],
connector_name="news-ner-connector",
) as g:
for schema in NODE_SCHEMAS:
g.ensure_node_schema(**schema)
for edge in EDGE_SCHEMAS:
g.ensure_edge_schema(**edge)
g.set_auto_commit_cost(every_nodes=2_000)
page = requests.get("https://news.example.com/api/articles?limit=500", timeout=30).json()
for record in page["items"]:
ingest_article(g, record)
Each spaCy span becomes a typed node and a Mentions edge from the article, so search and graph traversal see the same entities a built-in Spotter would have produced. Swap spacy for Stanza, Flair, or a Hugging Face pipeline by changing the import — the graph side is identical.
LLM extraction
Use a language model with a structured-output prompt. Like ML / NER, this is not a UI category — you call it from your own code (most commonly a Custom Code Index, a scheduled task, or a custom endpoint) and link the results back.
Pick this when:
- The field is genuinely open-ended — extracting "customer's intent" from a conversation, structured fields from a contract clause.
- You can afford the latency and per-call cost (typically batch / offline workflows).
- You can stomach occasional drift in output format — and you have validation in place.
Avoid when:
- High volume / low latency requirements (real-time ingestion).
- The vocabulary is small and stable — Spotters are cheaper and more precise.
- You need exact reproducibility — LLM outputs drift between model versions.
See Prompting patterns → Extraction for templates.
Example — Custom Code Index extracting structured entities with an LLM
The index below runs over SupportCase nodes. For each batch it pulls the case text, asks the configured LLM for a JSON object matching a fixed schema (organizations, people, devices, and the customer's intent), validates the JSON, and links the extracted entities into the graph as Mentions edges. It uses the code index execution scope — Graph, ChatAI, ToIndex, CurrentUser, CancellationToken are all available as top-level identifiers.
using System.Text.Json;
using System.Text.Json.Serialization;
using Mosaik.AI;
// Shape we want the LLM to return — also the JSON schema we send to the model.
public sealed class CaseExtraction
{
[JsonPropertyName("intent")] public string Intent { get; set; }
[JsonPropertyName("organizations")] public string[] Organizations { get; set; } = [];
[JsonPropertyName("people")] public string[] People { get; set; } = [];
[JsonPropertyName("devices")] public string[] Devices { get; set; } = [];
}
const string SystemPrompt =
"""
You extract structured facts from a single customer support case.
Return ONLY a JSON object that matches the provided schema.
- intent: one short sentence describing what the customer is asking for.
- organizations, people, devices: deduplicated canonical names from the case.
Do NOT invent values that aren't present in the text. Use [] when none apply.
""";
// Build the response_format once — the SDK derives a JSON schema from the CLR type.
var responseFormat = ChatAIProviderShared.GetTypedResponseFormat<CaseExtraction>();
foreach (var uid in ToIndex)
{
var node = Graph.Get(uid);
if (node is null) continue;
var summary = node.GetString(N.SupportCase.Summary) ?? "";
var content = node.GetString(N.SupportCase.Content) ?? "";
if (string.IsNullOrWhiteSpace(content)) continue;
var prompts = new List<IChatAIMessage>
{
new ChatAIMessage(ChatAuthorRole.System, SystemPrompt),
new ChatAIMessage(ChatAuthorRole.User,
$"SUBJECT: {summary}\n\nBODY:\n{ChatAI.LimitTokens(content, maxTokens: 4_000)}"),
};
ChatAIMessage completion;
try
{
completion = await ChatAI.GetCompletionAsync(
userUID: CurrentUser,
prompts: prompts,
maxTokens: 600,
responseFormat: responseFormat,
cancellationToken: CancellationToken);
}
catch (Exception ex)
{
Logger.LogWarning(ex, "LLM extraction failed for {Uid}", uid);
continue;
}
CaseExtraction extracted;
try
{
extracted = JsonSerializer.Deserialize<CaseExtraction>(completion.Text)
?? new CaseExtraction();
}
catch (JsonException)
{
Logger.LogWarning("LLM returned non-JSON for {Uid}: {Text}", uid, completion.Text);
continue;
}
// Persist the structured intent on the case itself.
if (!string.IsNullOrWhiteSpace(extracted.Intent))
{
Graph.AddOrUpdate(node.UID, n => n.SetString(N.SupportCase.Intent, extracted.Intent));
}
// Link captured entities. AddOrUpdateByKey is idempotent — re-running the index
// collapses duplicates into the same canonical nodes.
LinkAll(uid, extracted.Organizations, N.Organization.Type, N.Organization.Name);
LinkAll(uid, extracted.People, N.Person.Type, N.Person.Name);
LinkAll(uid, extracted.Devices, N.Device.Type, N.Device.Name);
}
await Graph.CommitPendingAsync();
void LinkAll(UID128 caseUid, string[] names, string nodeType, string nameField)
{
foreach (var raw in names ?? [])
{
var name = raw?.Trim();
if (string.IsNullOrEmpty(name)) continue;
var entityUid = Graph.AddOrUpdateByKey(nodeType, name, n => n.SetString(nameField, name));
Graph.Link(caseUid, entityUid, Edges.Mentions, Edges.MentionedIn);
}
}
The pattern has the four production-shape stages from Prompting patterns: retrieve the node text, pack a small prompt with a typed response_format (the SDK turns the CaseExtraction CLR type into the JSON schema sent to the provider), generate with a bounded maxTokens, and validate by parsing the JSON before touching the graph. Linking happens last, after the output has been deserialized cleanly — so a bad model response can never write garbage edges.
Recommended combinations
Most production systems run at least two model types in a pipeline.
| Combination | Where it shines |
|---|---|
| Spotter → Pattern Spotter | Known products + ticket/asset IDs in support text. |
| Spotter → ML / NER | Known accounts/customers + generic people/orgs in CRM notes. |
| Pattern Spotter → LLM (long-tail) | Catch IDs cheaply, send everything else to a model for soft extraction. |
| Spotter → Pattern Spotter → ML / NER → LLM | "Throw everything at it" for high-stakes corpora where coverage matters. |
The order matters: cheaper extractors first remove certainty cases so the expensive ones don't reprocess them.
Migration paths
- Start with Spotters. They give the most signal per hour of work. Build the vocabulary from the canonical entities already in your graph — point a New Spotter at the relevant node type and field.
- Add Pattern Spotters next. Find the high-volume IDs the support team copy-pastes and capture them with a regex.
- Add ML / NER when generic types matter. Run it externally (e.g. spaCy through the
curiosityPython connector) on the longest fields, not headers. - Add LLM extraction last. Wire a Custom Code Index or scheduled task to a typed structured-output call. Only reach for it when the gap left by the other three is worth the cost.
Selecting per use case
Translate the business question into a model choice:
| Business question | Best model type |
|---|---|
| "Show me every ticket about MacBook Air." | Spotter (product list). |
| "List every error code that appeared in the last week." | Pattern Spotter. |
| "Which customers and partners did this email mention?" | Spotter + ML / NER. |
| "What does the customer actually want in this 800-word email?" | LLM extraction. |
| "Build a topic taxonomy from 100k support cases." | LLM extraction (offline batch). |
Where to go next
- Entity extraction — full configuration and review workflow.
- NLP overview — how extraction fits with embeddings and the graph.
- Prompting patterns → Extraction — LLM templates.
- Search optimization — turning extracted entities into facets.