#
Pipeline Usage
The Pipeline class is the central orchestrator in Catalyst. It defines the sequence of processing steps that are applied to documents.
#
Creating a Pipeline
Before creating a pipeline for a specific language, ensure you have installed the corresponding NuGet package and registered it.
# Example: Adding English language support
dotnet add package Catalyst.Models.English
// Register the language models
Catalyst.Models.English.Register();
#
Default Pipeline
You can easily create a default pipeline for a specific language using Pipeline.ForAsync. This typically includes a tokenizer, a sentence detector, and a POS tagger.
var nlp = await Pipeline.ForAsync(Language.English);
#
Tokenizer-only Pipeline
If you only need tokenization (and optionally sentence detection), use Pipeline.TokenizerForAsync. This is faster than the default pipeline as it skips the part-of-speech tagging step.
var nlp = await Pipeline.TokenizerForAsync(Language.English, sentenceDetector: true);
You can also create a pipeline and explicitly disable the tagger to improve performance:
// Creates a pipeline with tokenizer and sentence detector, but no tagger
var nlp = await Pipeline.ForAsync(Language.English, tagger: false);
#
Customizing the Pipeline
Pipelines are flexible and allow you to add or remove processing steps.
#
Adding Processes
You can add any model that implements IProcess to the pipeline.
var nlp = await Pipeline.ForAsync(Language.English);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(Language.English, Version.Latest, "WikiNER"));
#
Custom Order
When you add a process using Add(), Catalyst automatically maintains a logical order:
- Normalizers
- Tokenizers
- Sentence Detectors
- Taggers
- Others (e.g., Entity Recognizers)
#
Removing Processes
You can remove models from the pipeline if they are no longer needed.
nlp.RemoveAll(p => p is ITagger);
#
Processing Documents
#
Single Document
For processing a single document, use ProcessSingle.
var doc = new Document("Text to process", Language.English);
nlp.ProcessSingle(doc);
#
Multiple Documents
For processing large numbers of documents, use Process. This method leverages multi-threading and lazy evaluation for better performance.
IEnumerable<IDocument> docs = GetDocuments();
var processedDocs = nlp.Process(docs);
foreach(var doc in processedDocs)
{
// Do something with the processed document
}
#
Storing and Loading Pipelines
You can store a configured pipeline and its models into a single binary file and load it back later.
// Store
using(var f = File.OpenWrite("my-pipeline.bin"))
{
nlp.PackTo(f);
}
// Load
using(var f = File.OpenRead("my-pipeline.bin"))
{
var nlp2 = await Pipeline.LoadFromPackedAsync(f);
}
#
Neuralyzers
Neuralyzers are special components that can be added to the pipeline to correct mistakes made by other models (e.g., adding or forgetting entities) based on patterns.
var neuralyzer = new Neuralyzer(Language.English, 0, "fixes");
neuralyzer.TeachAddPattern("Organization", "Amazon", mp => mp.Add(new PatternUnit(P.Single().WithToken("Amazon"))));
nlp.UseNeuralyzer(neuralyzer);