# Models in Catalyst

Catalyst provides several types of models to handle different NLP tasks. This page explains the most common model types and how to use them.

# Tokenizers

Tokenizers split raw text into individual tokens (words, punctuation, etc.).

# FastTokenizer

The default tokenizer in Catalyst. It is highly efficient and non-destructive.

var tokenizer = new FastTokenizer(Language.English);

# Part-of-Speech (POS) Taggers

POS taggers assign grammatical tags (e.g., Noun, Verb, Adjective) to each token. Catalyst uses Universal Dependencies for POS tagging.

# AveragePerceptronTagger

A fast and accurate POS tagger based on the averaged perceptron algorithm.

var tagger = await AveragePerceptronTagger.FromStoreAsync(Language.English, Version.Latest, "");

Example Output:

For the sentence: "The quick brown fox jumps over the lazy dog"

Token POS Tag
The DET
quick ADJ
brown ADJ
fox NOUN
jumps VERB
over ADP
the DET
lazy ADJ
dog NOUN

# Named Entity Recognition (NER)

NER models identify and categorize entities in text (e.g., Names, Organizations, Locations). Catalyst supports three main types:

# 1. Spotter

A gazetteer-like model that matches a predefined set of words or phrases.

var spotter = new Spotter(Language.Any, 0, "programming", "ProgrammingLanguage");
spotter.AddEntry("C#");
spotter.AddEntry("Python");

Example Output:

For the sentence: "I love coding in C# and Python."

Entity Entity Type
C# ProgrammingLanguage
Python ProgrammingLanguage

# 2. PatternSpotter

A rule-based model that uses complex patterns of tokens to identify entities. Conceptual equivalent of RegEx but on tokens.

var isApattern = new PatternSpotter(Language.English, 0, "is-a-pattern", "IsA");
isApattern.NewPattern(
    "Is+Noun",
    mp => mp.Add(
        new PatternUnit(P.Single().WithToken("is").WithPOS(PartOfSpeech.VERB)),
        new PatternUnit(P.Multiple().WithPOS(PartOfSpeech.NOUN, PartOfSpeech.PROPN, PartOfSpeech.DET, PartOfSpeech.ADJ))
));

Example Output:

For the sentence: "Catalyst is a high-performance library."

Match Entity Type
is a high-performance library IsA

# 3. AveragePerceptronEntityRecognizer

A statistical model for NER, typically trained on large datasets like WikiNER.

var ner = await AveragePerceptronEntityRecognizer.FromStoreAsync(Language.English, Version.Latest, "WikiNER");

# Iterating through Entities

Once a document has been processed by an NER model, you can iterate through the captured entities:

foreach (var span in doc)
{
    foreach (var entity in span.GetEntities())
    {
        Console.WriteLine($"Entity: {entity.Value} [{entity.EntityType.Type}]");
    }
}

Or using LINQ:

var entities = doc.SelectMany(span => span.GetEntities());
foreach(var entity in entities)
{
    Console.WriteLine($"Entity: {entity.Value} [{entity.EntityType.Type}]");
}

# Embeddings

Embeddings represent words or documents as dense vectors in a continuous vector space.

# FastText

Supports training and using FastText word and document embeddings.

var ft = new FastText(Language.English, 0, "my-fasttext-model");
ft.Train(nlp.Process(docs));

Example: Vector Retrieval and Similarity

// Get vector for a word
float[] vector = ft.GetVector("apple", Language.English);

// Compute similarity between two words
float[] vector1 = ft.GetVector("apple", Language.English);
float[] vector2 = ft.GetVector("orange", Language.English);
float similarity = vector1.CosineSimilarityWith(vector2);

# Language Detectors

Language detectors identify the language of a given text.

# FastTextLanguageDetector

Uses FastText models for accurate language detection.

var detector = await FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");

# LanguageDetector

Derived from Google's CLD3 (Compact Language Detector 3).

var detector = await LanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");

# Normalizers

Normalizers transform text to a standard form (e.g., lowercasing, removing punctuation).

Normalizer Input Output
LowerCaseNormalizer "Hello World" "hello world"
UpperCaseNormalizer "Hello World" "HELLO WORLD"
HtmlNormalizer "<b>Hello</b>" "Hello"
FoldToAsciiNormalizer "Crème brûlée" "Creme brulee"
RemovePunctuationNormalizer "Hello, World!" "Hello World"

Usage:

nlp.Add(new LowerCaseNormalizer());