# 🚀 Catalyst: High-Performance NLP for .NET

Catalyst is a modern, pure C# Natural Language Processing (NLP) library built for extreme speed and efficiency. Inspired by the design philosophy of spaCy, Catalyst brings production-grade NLP capabilities—including pre-trained models, word embeddings, and entity recognition—to the .NET ecosystem.

Get Started • Explore Models • View Samples • API Reference


# Why Catalyst?

  • Built for Speed: Process over 1,000,000 tokens/s on a modern CPU.
  • Pure C#: No Python dependencies or heavy wrappers. Native, modern .NET support.
  • Cross-Platform: Runs seamlessly on Windows, Linux, macOS, and ARM architectures.
  • Non-Destructive: Tokenization preserves the original text, allowing for perfect mapping between processed tokens and raw input.
  • spaCy-Inspired Pipeline: A familiar "Pipeline" architecture for tokenization, lemmatization, POS tagging, and NER.

# 🛠 Core Features

Feature Description
Tokenization RegEx-free, high-speed tokenization with 99.9% accuracy.
Entity Recognition Supports Gazetteer (Spotter), Rule-based (PatternSpotter), and Perceptron models.
Embeddings Out-of-the-box support for training FastText and StarSpace embeddings.
Lemmatization Accurate root-form extraction using lookup tables ported from spaCy.
Language Detection Fast and accurate detection using FastText or CLD3.
Modern Serialization Efficient binary serialization via MessagePack for fast model loading.

# 🏁 Getting Started

# 1. Installation

Install the core library and the language package for your target language via NuGet:

dotnet add package Catalyst
dotnet add package Catalyst.Models.English

# 2. Basic Usage

Catalyst makes it easy to process text in just a few lines of code.

using Catalyst;
using Catalyst.Models;
using Mosaik.Core;

// 1. Register the language and set storage location
Catalyst.Models.English.Register(); 
Storage.Current = new DiskStorage("catalyst-models");

// 2. Load the NLP pipeline
var nlp = await Pipeline.ForAsync(Language.English);

// 3. Process your document
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.ProcessSingle(doc);

// 4. Access results (Tokens, POS Tags, Lemmatization)
foreach(var span in doc) 
{
    foreach(var token in span) 
    {
        Console.WriteLine($"{token.Value} [{token.POS}] -> {token.Lemma}");
    }
}

# 🌍 Language Support

Catalyst provides pre-trained models for a wide variety of languages through the Universal Dependencies project. All language data is distributed as modular NuGet packages, ensuring your application only carries the weight it needs.

  • Available Packages: English, French, German, Spanish, Italian, and more.

# 🧠 Advanced Capabilities

# Multi-threaded Processing

Leverage .NET's native multi-threading to process large collections of documents efficiently:

var docs = GetLargeDocumentCollection();
var processedDocs = nlp.Process(docs); // Internally parallelized & lazy-evaluated

# Pattern Matching (Entity Spotting)

Create complex rule-based entity recognizers using the PatternSpotter:

var spotter = new PatternSpotter(Language.English, 0, tag: "tech-stack", captureTag: "Tech");
spotter.NewPattern("C#", mp => mp.Add(new PatternUnit(P.Single().WithToken("C#"))));
nlp.Add(spotter);

# 📖 Learn More

  • Tutorials: Deep dives into NER, Embeddings, and Training.
  • Contributing: We welcome PRs! Help us make .NET NLP even faster.

Maintained by Curiosity. Licensed under MIT.