Entity Extraction
Entity extraction finds meaningful spans in text and turns them into structured outputs you can search, facet, and link to the graph. Examples: product names, device IDs, customer names, ticket numbers, locations.
This page covers configuration, model types, confidence thresholds, and the review loop. For where extraction sits in the bigger picture, see NLP overview. For the dictionary/pattern/ML/LLM comparison, see Types of models.
Extraction vs linking
- Extraction finds the span in text. Output:
{ field, start, end, text, type, confidence }. - Linking maps that span to an existing graph node (or creates a new one). Output: an edge from the document to the entity node.
You can extract without linking (just annotations on text). You can't link without extracting.
A minimal pipeline configuration
Configured per (nodeType, field). In YAML the shape is:
pipeline:
language: auto # or "en", "fr", …
spotters:
- kind: dictionary
name: products
entries:
- value: MacBook Air
aliases: [MBA, "MacBook Air 2024", "Mac Book Air"]
link_to: Device:MBA-2024
- value: MacBook Pro
aliases: [MBP]
link_to: Device:MBP-2024
min_confidence: 0.8
case_sensitive: false
- kind: pattern
name: ticket-id
regex: "TICKET-\\d{4,6}"
link_to_type: SupportCase
link_strategy: by-key
exclusions:
- context_includes: ["e.g.", "for example"]
reason: "training-mention false positive"
You typically edit this in the admin UI; the YAML above shows what's stored under the hood.
Model types
The first two — Spotter and Pattern Spotter — are configured directly in Settings → Entity Capturing in the admin UI. The other two — ML / NER and LLM extraction — are run programmatically from a data connector, a Custom Code Index, or a custom endpoint, writing entities back as graph nodes and Mentions edges. See Types of models for the full comparison and runnable examples.
Spotter
Curated list of canonical terms with aliases. Use for finite vocabularies that change rarely — product catalogs, customer names, country lists.
- kind: dictionary
name: countries
entries:
- value: "United States"
aliases: [USA, "U.S.", "United States of America"]
case_sensitive: false
Strengths: high precision, fast, deterministic. Weakness: misses anything not in the list.
Pattern Spotter
Regex-style matcher. Use for structured identifiers, codes, and formats.
- kind: pattern
name: ticket-id
regex: "TICKET-\\d{4,6}"
context_must_include: ["ticket", "case"] # optional disambiguation
Strengths: catches IDs the dictionary can't enumerate. Weakness: over-fires on unrelated number strings unless you constrain by context.
ML / NER
Pre-trained model recognizes generic types: PERSON, ORG, LOCATION, DATE, MONEY. Use when the vocabulary is open-ended and a generic type is what you want.
- kind: ml-ner
model: spacy-en-core-web-md
types: [PERSON, ORG, DATE]
min_confidence: 0.7
Strengths: covers unknown entities. Weakness: domain-specific accuracy is uneven; "Apple" in a fruit catalog isn't a company.
LLM extraction
A language model with a structured-output prompt. Use when rules and dictionaries can't reach the long tail — extracting intents, structured fields from quotes, summary-style outputs.
See Prompting patterns → Extraction.
Custom entity examples
Custom dictionary with linking
- kind: dictionary
name: engineering-teams
entries:
- value: Platform Team
link_to: Team:platform
- value: Mobile Team
link_to: Team:mobile
aliases: [iOS Team, Android Team]
link_strategy: by-uid
Each extracted mention creates a Mentions edge from the document to the team node — search results can now facet by team.
Pattern + exclusion
- kind: pattern
name: error-code
regex: "0x[0-9A-Fa-f]{8}"
exclusions:
- context_includes: ["example", "e.g."]
Catches every Windows-style error code; ignores documentation examples.
Hybrid pipeline
spotters:
- kind: dictionary
name: products
entries: [ … ]
- kind: pattern
name: ids
regex: "ASSET-\\d+"
- kind: ml-ner
types: [PERSON, ORG]
min_confidence: 0.85
Dictionary first (cheapest, most precise), pattern for IDs, ML for everything else.
Confidence thresholds
Every extractor emits a confidence score. The right threshold depends on what you do with the result.
| Downstream use | Suggested floor |
|---|---|
| Display the extracted entity in the UI | ≥ 0.6 |
| Use as a facet | ≥ 0.7 |
| Auto-link to the graph | ≥ 0.85 |
| Auto-merge / canonicalize | ≥ 0.95 |
Don't auto-link below 0.85 — wrong edges are expensive to clean up later.
Review workflow
- Sample 100–200 documents. Stratify by source, length, language.
- Run extraction in shadow mode. Don't write to the graph yet.
- Hand-label. Mark each extracted span as TP (correct), FP (wrong), or FN (missed).
- Compute precision / recall.
- precision = TP / (TP + FP)
- recall = TP / (TP + FN)
- Iterate. High FP → add exclusions, raise threshold. Low recall → expand dictionary, add aliases.
- Promote to production when precision is acceptable. "Acceptable" depends on use:
- Search/facet use: precision ≥ 0.8.
- Auto-link: precision ≥ 0.95.
- Schedule re-review. Quarterly, or after every dictionary update.
Common pitfalls
- Aliases as an afterthought. "MBA," "Mac Book Air," "MacBook Air 2024" — write them all in the dictionary, not just the canonical form.
- Patterns without context.
"\d{4}"matches every year, postcode, and asset ID. Constrain with surrounding tokens. - No exclusions. "For example, error 0x12345678" is a documentation mention, not a real error. Exclude.
- Linking before extraction is trustworthy. Wrong edges in the graph are worse than no edges.
- Single-model bias. Hybrid pipelines (dictionary + pattern + ML) consistently beat any single model.
- No re-extraction after schema change. Adding a new spotter? Re-process the existing corpus or the new rules only apply to new documents.
Parsing a string programmatically
From a custom endpoint, scheduled task, or Custom Code Index you can run the exact same flow a field goes through during ingestion — pipeline parsing followed by entity linking — against an arbitrary string, without storing a node. This is the Graph.Parsing helper. It is the read-side counterpart to Graph.Embeddings: it resolves the pipeline registered for a (nodeType, field) pair, parses the text with it, and then runs the document-to-graph index registered for that same pair (if any) so extracted entities resolve to real graph nodes.
Use it to preview how a piece of text would be parsed and linked — for example to power a "test this rule" UI, to classify inbound text before deciding where to store it, or to extract links from text that never becomes a node.
Graph.Parsing.ParseAsync
Parse a string as if it were the value of a field, and return the linked Catalyst document:
var doc = await Graph.Parsing.ParseAsync(
text: "Screen flicker on the MacBook Air after the latest update",
nodeType: N.SupportCase.Type,
field: N.SupportCase.Description,
cancellationToken);
foreach (var entity in doc.SelectMany(span => span.GetEntities()))
{
// entity.EntityType.Type → the spotter/pattern type that matched
// entity.Value → the matched span of text
// entity.EntityType.TargetUID → the linked graph node, if entity linking resolved one
}
| Parameter | Type | Notes |
|---|---|---|
text |
string |
The text to parse. |
nodeType |
string |
The node type whose pipeline and entity-linking config to apply. |
field |
string |
The field within that node type. |
cancellationToken |
CancellationToken |
Optional. Cancels parsing and linking. |
Returns the parsed Document, or null if no pipeline is registered for the (nodeType, field) pair. If a pipeline exists but no document-to-graph index is registered, the document is still parsed (spotters, patterns, tokenization run) but entity mentions are not resolved to nodes — TargetUID stays unset for non-linked spotters.
Graph.Parsing.GetLinksAsync
When you only need the edges entity linking would create — not the full annotated document — call GetLinksAsync:
var edges = await Graph.Parsing.GetLinksAsync(
text: "Screen flicker on the MacBook Air after the latest update",
nodeType: N.SupportCase.Type,
field: N.SupportCase.Description,
cancellationToken);
foreach (var edge in edges)
{
// edge.NodeUID → the linked node
// edge.NodeTypeUID → its type
// edge.EdgeTypeUID → the edge type that would connect the document to it
}
Returns the list of Edges produced by the document-to-graph index for that (nodeType, field) pair. The list is empty when no pipeline or no document-to-graph index is registered, so a missing configuration is not an error — it just means there is nothing to link.
These edges are computed in memory and not written to the graph; persisting them is the caller's decision.
Where to go next
- Types of models — dictionary / pattern / ML / LLM head-to-head.
- Prompting patterns → Extraction — LLM-driven extraction.
- Embeddings — when semantic similarity covers what extraction can't.
- Reindexing and re-embedding — re-processing after rule changes.