Curiosity - Entity Extraction

Entity Extraction

Entity extraction finds meaningful spans in text and turns them into structured outputs you can search, facet, and link to the graph. Examples: product names, device IDs, customer names, ticket numbers, locations.

This page covers configuration, model types, confidence thresholds, and the review loop. For where extraction sits in the bigger picture, see NLP overview. For the dictionary/pattern/ML/LLM comparison, see Types of models.

Extraction vs linking

Extraction finds the span in text. Output: { field, start, end, text, type, confidence }.
Linking maps that span to an existing graph node (or creates a new one). Output: an edge from the document to the entity node.

You can extract without linking (just annotations on text). You can't link without extracting.

A minimal pipeline configuration

Configured per (nodeType, field). In YAML the shape is:

pipeline:
  language: auto                  # or "en", "fr", …
  spotters:
    - kind: dictionary
      name: products
      entries:
        - value: MacBook Air
          aliases: [MBA, "MacBook Air 2024", "Mac Book Air"]
          link_to: Device:MBA-2024
        - value: MacBook Pro
          aliases: [MBP]
          link_to: Device:MBP-2024
      min_confidence: 0.8
      case_sensitive: false
    - kind: pattern
      name: ticket-id
      regex: "TICKET-\\d{4,6}"
      link_to_type: SupportCase
      link_strategy: by-key
  exclusions:
    - context_includes: ["e.g.", "for example"]
      reason: "training-mention false positive"

You typically edit this in the admin UI; the YAML above shows what's stored under the hood.

Model types

The first two — Spotter and Pattern Spotter — are configured directly in Settings → Entity Capturing in the admin UI. The other two — ML / NER and LLM extraction — are run programmatically from a data connector, a Custom Code Index, or a custom endpoint, writing entities back as graph nodes and Mentions edges. See Types of models for the full comparison and runnable examples.

Spotter

Curated list of canonical terms with aliases. Use for finite vocabularies that change rarely — product catalogs, customer names, country lists.

- kind: dictionary
  name: countries
  entries:
    - value: "United States"
      aliases: [USA, "U.S.", "United States of America"]
  case_sensitive: false

Strengths: high precision, fast, deterministic. Weakness: misses anything not in the list.

Pattern Spotter

Regex-style matcher. Use for structured identifiers, codes, and formats.

- kind: pattern
  name: ticket-id
  regex: "TICKET-\\d{4,6}"
  context_must_include: ["ticket", "case"]      # optional disambiguation

Strengths: catches IDs the dictionary can't enumerate. Weakness: over-fires on unrelated number strings unless you constrain by context.

ML / NER

Pre-trained model recognizes generic types: PERSON, ORG, LOCATION, DATE, MONEY. Use when the vocabulary is open-ended and a generic type is what you want.

- kind: ml-ner
  model: spacy-en-core-web-md
  types: [PERSON, ORG, DATE]
  min_confidence: 0.7

Strengths: covers unknown entities. Weakness: domain-specific accuracy is uneven; "Apple" in a fruit catalog isn't a company.

LLM extraction

A language model with a structured-output prompt. Use when rules and dictionaries can't reach the long tail — extracting intents, structured fields from quotes, summary-style outputs.

See Prompting patterns → Extraction.

Custom entity examples

Custom dictionary with linking

- kind: dictionary
  name: engineering-teams
  entries:
    - value: Platform Team
      link_to: Team:platform
    - value: Mobile Team
      link_to: Team:mobile
      aliases: [iOS Team, Android Team]
  link_strategy: by-uid

Each extracted mention creates a Mentions edge from the document to the team node — search results can now facet by team.

Pattern + exclusion

- kind: pattern
  name: error-code
  regex: "0x[0-9A-Fa-f]{8}"
  exclusions:
    - context_includes: ["example", "e.g."]

Catches every Windows-style error code; ignores documentation examples.

Hybrid pipeline

spotters:
  - kind: dictionary
    name: products
    entries: [ … ]
  - kind: pattern
    name: ids
    regex: "ASSET-\\d+"
  - kind: ml-ner
    types: [PERSON, ORG]
    min_confidence: 0.85

Dictionary first (cheapest, most precise), pattern for IDs, ML for everything else.

Confidence thresholds

Every extractor emits a confidence score. The right threshold depends on what you do with the result.

Downstream use	Suggested floor
Display the extracted entity in the UI	`≥ 0.6`
Use as a facet	`≥ 0.7`
Auto-link to the graph	`≥ 0.85`
Auto-merge / canonicalize	`≥ 0.95`

Don't auto-link below 0.85 — wrong edges are expensive to clean up later.

Review workflow

Sample 100–200 documents. Stratify by source, length, language.
Run extraction in shadow mode. Don't write to the graph yet.
Hand-label. Mark each extracted span as TP (correct), FP (wrong), or FN (missed).
Compute precision / recall.
- precision = TP / (TP + FP)
- recall = TP / (TP + FN)
Iterate. High FP → add exclusions, raise threshold. Low recall → expand dictionary, add aliases.
Promote to production when precision is acceptable. "Acceptable" depends on use:
- Search/facet use: precision ≥ 0.8.
- Auto-link: precision ≥ 0.95.
Schedule re-review. Quarterly, or after every dictionary update.

Common pitfalls

Aliases as an afterthought. "MBA," "Mac Book Air," "MacBook Air 2024" — write them all in the dictionary, not just the canonical form.
Patterns without context. "\d{4}" matches every year, postcode, and asset ID. Constrain with surrounding tokens.
No exclusions. "For example, error 0x12345678" is a documentation mention, not a real error. Exclude.
Linking before extraction is trustworthy. Wrong edges in the graph are worse than no edges.
Single-model bias. Hybrid pipelines (dictionary + pattern + ML) consistently beat any single model.
No re-extraction after schema change. Adding a new spotter? Re-process the existing corpus or the new rules only apply to new documents.

Parsing a string programmatically

From a custom endpoint, scheduled task, or Custom Code Index you can run the exact same flow a field goes through during ingestion — pipeline parsing followed by entity linking — against an arbitrary string, without storing a node. This is the Graph.Parsing helper. It is the read-side counterpart to Graph.Embeddings: it resolves the pipeline registered for a (nodeType, field) pair, parses the text with it, and then runs the document-to-graph index registered for that same pair (if any) so extracted entities resolve to real graph nodes.

Use it to preview how a piece of text would be parsed and linked — for example to power a "test this rule" UI, to classify inbound text before deciding where to store it, or to extract links from text that never becomes a node.

`Graph.Parsing.ParseAsync`

Parse a string as if it were the value of a field, and return the linked Catalyst document:

var doc = await Graph.Parsing.ParseAsync(
    text:     "Screen flicker on the MacBook Air after the latest update",
    nodeType: N.SupportCase.Type,
    field:    N.SupportCase.Description,
    cancellationToken);

foreach (var entity in doc.SelectMany(span => span.GetEntities()))
{
    // entity.EntityType.Type     → the spotter/pattern type that matched
    // entity.Value               → the matched span of text
    // entity.EntityType.TargetUID → the linked graph node, if entity linking resolved one
}

Parameter	Type	Notes
`text`	`string`	The text to parse.
`nodeType`	`string`	The node type whose pipeline and entity-linking config to apply.
`field`	`string`	The field within that node type.
`cancellationToken`	`CancellationToken`	Optional. Cancels parsing and linking.

Returns the parsed Document, or null if no pipeline is registered for the (nodeType, field) pair. If a pipeline exists but no document-to-graph index is registered, the document is still parsed (spotters, patterns, tokenization run) but entity mentions are not resolved to nodes — TargetUID stays unset for non-linked spotters.

`Graph.Parsing.GetLinksAsync`

When you only need the edges entity linking would create — not the full annotated document — call GetLinksAsync:

var edges = await Graph.Parsing.GetLinksAsync(
    text:     "Screen flicker on the MacBook Air after the latest update",
    nodeType: N.SupportCase.Type,
    field:    N.SupportCase.Description,
    cancellationToken);

foreach (var edge in edges)
{
    // edge.NodeUID      → the linked node
    // edge.NodeTypeUID  → its type
    // edge.EdgeTypeUID  → the edge type that would connect the document to it
}

Returns the list of Edges produced by the document-to-graph index for that (nodeType, field) pair. The list is empty when no pipeline or no document-to-graph index is registered, so a missing configuration is not an error — it just means there is nothing to link.

These edges are computed in memory and not written to the graph; persisting them is the caller's decision.

Where to go next

Types of models — dictionary / pattern / ML / LLM head-to-head.
Prompting patterns → Extraction — LLM-driven extraction.
Embeddings — when semantic similarity covers what extraction can't.
Reindexing and re-embedding — re-processing after rule changes.