Curiosity

Types of NLP Models

A side-by-side look at the four kinds of extraction model Curiosity Workspace can run: Spotters, Pattern Spotters, ML / NER, and LLM extraction. Pick the right one — or the right combination — based on your vocabulary, volume, and accuracy needs.

The first two — Spotter and Pattern Spotter — are first-class model types you create and edit in the admin UI under Settings → Entity Capturing (tabs Spotters and Patterns). The other two — ML / NER and LLM extraction — are integrated programmatically: you call them from a data connector, a custom code index, or a custom endpoint, and write the results back to the graph yourself.

For when extraction itself is the right tool, see NLP overview. For configuration, see Entity extraction.

Comparison at a glance

Aspect Spotter Pattern Spotter ML / NER LLM extraction
UI surface Entity Capturing → Spotters Entity Capturing → Patterns None (external library + code) None (custom code index / endpoint)
Best for Finite vocabularies Structured identifiers Generic types (PERSON, ORG, …) Open-ended fields, intents, summaries
Precision Very high Medium–high Medium Medium–high (depends on prompt)
Recall Limited to the list Wide on shape, miss novel Wide for trained types Widest
Latency Microseconds Microseconds Milliseconds Hundreds of ms — seconds
Cost / 1k docs Negligible Negligible CPU/GPU compute Provider API call (per token)
Maintenance Update list as vocab grows Update regex on new shapes Re-train when accuracy drifts Update prompt; track model versions
Determinism Deterministic Deterministic Mostly deterministic Stochastic (set temperature = 0)
Multilingual Per-language dictionary Pattern-dependent Per-language model Inherent (most modern LLMs)
Confidence Binary or boost-weighted Binary or context-weighted Probabilistic score Self-reported, unreliable

Spotter

Curated list of canonical terms with aliases, matched against tokenized text. Sometimes called a dictionary or gazetteer; in the UI it's just Spotter. A spotter built from a graph node type (so the vocabulary stays in sync with what's already in the graph) is offered through the New Spotter button on Settings → Entity Capturing → Spotters (and from Management → Data → node type → Capture → New Spotter).

Pick this when:

  • Your vocabulary is enumerable — product catalog, customer list, internal team names.
  • You need deterministic results for compliance.
  • You can keep the list updated as the catalog changes.

Avoid when:

  • The space is open-ended (every possible person name, every potential brand).
  • Your alias coverage is poor — you'll under-extract.

Configuration: see Entity extraction → Spotter and Spotter models.

Pattern Spotter

Regex-style matching for codes and identifiers. Created from Settings → Entity Capturing → Patterns → New Pattern (or Management → Data → node type → Capture → New Pattern Spotter).

Pick this when:

  • Your entity has a consistent format — TICKET-12345, 0xDEADBEEF, SKU-AB-1234.
  • The format is unambiguous in context.

Avoid when:

  • The pattern is too generic (\d{4} matches years, postcodes, ages, asset IDs — all of them).
  • The format changes across vendors or sources without you knowing.

Always pair patterns with context constraints (context_must_include) and exclusions (context_includes → reject) to control over-firing.

ML / NER

Pre-trained named-entity recognizers — spaCy, Stanza, Hugging Face transformer pipelines, or any other library you already use. There is no built-in ML / NER type in the Entity Capturing UI. Instead, run the model externally as part of your ingestion code and write the entities back through the curiosity data connector library: fetch the source data, parse it with the NER model of your choice, and link the captured entities into the graph using the same add_or_update_by_key / link calls a normal connector uses.

Pick this when:

  • You need generic types and don't have time to curate dictionaries (PERSON, ORG, LOCATION, MONEY, DATE).
  • Your text is well-formed prose (news, emails, articles).
  • You already have a Python (or C#) NER toolchain you're happy with.

Avoid when:

  • Domain ambiguity is high — "Apple" in a tech corpus is unambiguous; in a grocery corpus it isn't.
  • You need to extract specific business entities, not generic types.

Mix with Spotters: run the Spotter first for known entities and let the external NER pick up the long tail.

Example — spaCy + the curiosity Python package

The connector below fetches news articles from a source API, runs spaCy NER on each body, and writes both the article and the captured entities (Person, Organization, Location) into the graph as linked nodes:

# pip install curiosity spacy
# python -m spacy download en_core_web_md
import os
import requests
import spacy
from curiosity import Graph

NLP = spacy.load("en_core_web_md")

SPACY_TO_NODE = {
    "PERSON": "Person",
    "ORG":    "Organization",
    "GPE":    "Location",
    "LOC":    "Location",
}

NODE_SCHEMAS = [
    {"type": "Article",      "key": "Id",   "properties": ["Title", "Body"], "timestamp": "Published"},
    {"type": "Person",       "key": "Name"},
    {"type": "Organization", "key": "Name"},
    {"type": "Location",     "key": "Name"},
]

EDGE_SCHEMAS = [
    {"name": "Mentions", "reverse": "MentionedIn"},
]

def ingest_article(g: Graph, src: dict) -> None:
    g.add_or_update_by_key(
        node_type="Article",
        key=src["id"],
        content={"Title": src["title"], "Body": src["body"], "Published": src["published"]},
    )

    doc = NLP(src["body"])
    seen: set[tuple[str, str]] = set()
    for ent in doc.ents:
        node_type = SPACY_TO_NODE.get(ent.label_)
        if node_type is None:
            continue
        name = ent.text.strip()
        if not name or (node_type, name) in seen:
            continue
        seen.add((node_type, name))

        g.add_or_update_by_key(node_type=node_type, key=name, content={"Name": name})
        g.link(
            from_type="Article", from_key=src["id"],
            to_type=node_type,    to_key=name,
            edge="Mentions",      reverse="MentionedIn",
        )

with Graph.connect(
    endpoint=os.environ["CURIOSITY_ENDPOINT"],
    token=os.environ["CURIOSITY_TOKEN"],
    connector_name="news-ner-connector",
) as g:
    for schema in NODE_SCHEMAS:
        g.ensure_node_schema(**schema)
    for edge in EDGE_SCHEMAS:
        g.ensure_edge_schema(**edge)

    g.set_auto_commit_cost(every_nodes=2_000)

    page = requests.get("https://news.example.com/api/articles?limit=500", timeout=30).json()
    for record in page["items"]:
        ingest_article(g, record)

Each spaCy span becomes a typed node and a Mentions edge from the article, so search and graph traversal see the same entities a built-in Spotter would have produced. Swap spacy for Stanza, Flair, or a Hugging Face pipeline by changing the import — the graph side is identical.

LLM extraction

Use a language model with a structured-output prompt. Like ML / NER, this is not a UI category — you call it from your own code (most commonly a Custom Code Index, a scheduled task, or a custom endpoint) and link the results back.

Pick this when:

  • The field is genuinely open-ended — extracting "customer's intent" from a conversation, structured fields from a contract clause.
  • You can afford the latency and per-call cost (typically batch / offline workflows).
  • You can stomach occasional drift in output format — and you have validation in place.

Avoid when:

  • High volume / low latency requirements (real-time ingestion).
  • The vocabulary is small and stable — Spotters are cheaper and more precise.
  • You need exact reproducibility — LLM outputs drift between model versions.

See Prompting patterns → Extraction for templates.

Example — Custom Code Index extracting structured entities with an LLM

The index below runs over SupportCase nodes. For each batch it pulls the case text, asks the configured LLM for a JSON object matching a fixed schema (organizations, people, devices, and the customer's intent), validates the JSON, and links the extracted entities into the graph as Mentions edges. It uses the code index execution scopeGraph, ChatAI, ToIndex, CurrentUser, CancellationToken are all available as top-level identifiers.

using System.Text.Json;
using System.Text.Json.Serialization;
using Mosaik.AI;

// Shape we want the LLM to return — also the JSON schema we send to the model.
public sealed class CaseExtraction
{
    [JsonPropertyName("intent")]        public string  Intent        { get; set; }
    [JsonPropertyName("organizations")] public string[] Organizations { get; set; } = [];
    [JsonPropertyName("people")]        public string[] People        { get; set; } = [];
    [JsonPropertyName("devices")]       public string[] Devices       { get; set; } = [];
}

const string SystemPrompt =
    """
    You extract structured facts from a single customer support case.
    Return ONLY a JSON object that matches the provided schema.
    - intent: one short sentence describing what the customer is asking for.
    - organizations, people, devices: deduplicated canonical names from the case.
      Do NOT invent values that aren't present in the text. Use [] when none apply.
    """;

// Build the response_format once — the SDK derives a JSON schema from the CLR type.
var responseFormat = ChatAIProviderShared.GetTypedResponseFormat<CaseExtraction>();

foreach (var uid in ToIndex)
{
    var node = Graph.Get(uid);
    if (node is null) continue;

    var summary = node.GetString(N.SupportCase.Summary) ?? "";
    var content = node.GetString(N.SupportCase.Content) ?? "";
    if (string.IsNullOrWhiteSpace(content)) continue;

    var prompts = new List<IChatAIMessage>
    {
        new ChatAIMessage(ChatAuthorRole.System, SystemPrompt),
        new ChatAIMessage(ChatAuthorRole.User,
            $"SUBJECT: {summary}\n\nBODY:\n{ChatAI.LimitTokens(content, maxTokens: 4_000)}"),
    };

    ChatAIMessage completion;
    try
    {
        completion = await ChatAI.GetCompletionAsync(
            userUID:        CurrentUser,
            prompts:        prompts,
            maxTokens:      600,
            responseFormat: responseFormat,
            cancellationToken: CancellationToken);
    }
    catch (Exception ex)
    {
        Logger.LogWarning(ex, "LLM extraction failed for {Uid}", uid);
        continue;
    }

    CaseExtraction extracted;
    try
    {
        extracted = JsonSerializer.Deserialize<CaseExtraction>(completion.Text)
                    ?? new CaseExtraction();
    }
    catch (JsonException)
    {
        Logger.LogWarning("LLM returned non-JSON for {Uid}: {Text}", uid, completion.Text);
        continue;
    }

    // Persist the structured intent on the case itself.
    if (!string.IsNullOrWhiteSpace(extracted.Intent))
    {
        Graph.AddOrUpdate(node.UID, n => n.SetString(N.SupportCase.Intent, extracted.Intent));
    }

    // Link captured entities. AddOrUpdateByKey is idempotent — re-running the index
    // collapses duplicates into the same canonical nodes.
    LinkAll(uid, extracted.Organizations, N.Organization.Type, N.Organization.Name);
    LinkAll(uid, extracted.People,        N.Person.Type,       N.Person.Name);
    LinkAll(uid, extracted.Devices,       N.Device.Type,       N.Device.Name);
}

await Graph.CommitPendingAsync();

void LinkAll(UID128 caseUid, string[] names, string nodeType, string nameField)
{
    foreach (var raw in names ?? [])
    {
        var name = raw?.Trim();
        if (string.IsNullOrEmpty(name)) continue;

        var entityUid = Graph.AddOrUpdateByKey(nodeType, name, n => n.SetString(nameField, name));
        Graph.Link(caseUid, entityUid, Edges.Mentions, Edges.MentionedIn);
    }
}

The pattern has the four production-shape stages from Prompting patterns: retrieve the node text, pack a small prompt with a typed response_format (the SDK turns the CaseExtraction CLR type into the JSON schema sent to the provider), generate with a bounded maxTokens, and validate by parsing the JSON before touching the graph. Linking happens last, after the output has been deserialized cleanly — so a bad model response can never write garbage edges.

Most production systems run at least two model types in a pipeline.

Combination Where it shines
Spotter → Pattern Spotter Known products + ticket/asset IDs in support text.
Spotter → ML / NER Known accounts/customers + generic people/orgs in CRM notes.
Pattern Spotter → LLM (long-tail) Catch IDs cheaply, send everything else to a model for soft extraction.
Spotter → Pattern Spotter → ML / NER → LLM "Throw everything at it" for high-stakes corpora where coverage matters.

The order matters: cheaper extractors first remove certainty cases so the expensive ones don't reprocess them.

Migration paths

  • Start with Spotters. They give the most signal per hour of work. Build the vocabulary from the canonical entities already in your graph — point a New Spotter at the relevant node type and field.
  • Add Pattern Spotters next. Find the high-volume IDs the support team copy-pastes and capture them with a regex.
  • Add ML / NER when generic types matter. Run it externally (e.g. spaCy through the curiosity Python connector) on the longest fields, not headers.
  • Add LLM extraction last. Wire a Custom Code Index or scheduled task to a typed structured-output call. Only reach for it when the gap left by the other three is worth the cost.

Selecting per use case

Translate the business question into a model choice:

Business question Best model type
"Show me every ticket about MacBook Air." Spotter (product list).
"List every error code that appeared in the last week." Pattern Spotter.
"Which customers and partners did this email mention?" Spotter + ML / NER.
"What does the customer actually want in this 800-word email?" LLM extraction.
"Build a topic taxonomy from 100k support cases." LLM extraction (offline batch).

Where to go next

© 2026 Curiosity. All rights reserved.
Powered by Neko