# Introduction to Catalyst

Catalyst is a high-performance Natural Language Processing (NLP) library for C#. Inspired by the design of spaCy, Catalyst provides a fast and modern way to process text in .NET.

# Key Features

  • Fast & Modern: Built in pure C#, supporting .NET Standard 2.0 and later.
  • Cross-Platform: Runs on Windows, Linux, macOS, and ARM.
  • Efficient Tokenization: Non-destructive, RegEx-free tokenization capable of processing over 1 million tokens per second.
  • Comprehensive NLP Tasks: Support for Tokenization, Sentence Detection, Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Language Detection, and more.
  • Pre-trained Models: Easy access to pre-trained models for various languages.

# Core Concepts

Understanding Catalyst requires familiarity with its core building blocks:

# 1. Document

The Document class represents a piece of text to be processed. It holds the original text and all the metadata generated during processing (tokens, spans, entities, etc.). See Document for more details.

var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);

# 2. Span

A Span represents a segment of a Document, typically a sentence. A Document can contain multiple spans.

# 3. Token

A Token is the basic unit of text, such as a word or punctuation mark, within a Span.

# 4. Pipeline

A Pipeline is a sequence of processing steps (models) that are applied to a Document. A typical pipeline includes a tokenizer, a sentence detector, and a POS tagger.

var nlp = await Pipeline.ForAsync(Language.English);
nlp.ProcessSingle(doc);

# 5. Language

Catalyst uses the Language enum to specify the language of a document or model. It supports a wide range of languages.

# Getting Started

To use Catalyst, you need to install the Catalyst NuGet package.

# Language Packages

All language-specific data and models are provided as separate NuGet packages. You can find all available packages here.

Before using a language, you must install its respective NuGet package and register it in your code.

# Example: Adding English language support via dotnet CLI
dotnet add package Catalyst.Models.English
using Catalyst;
using Catalyst.Models;
using Mosaik.Core;

// Register the English language models
Catalyst.Models.English.Register();

// Configure storage for lazy-loading models
Storage.Current = new DiskStorage("catalyst-models");

// Create a pipeline for English
var nlp = await Pipeline.ForAsync(Language.English);

// Create and process a document
var doc = new Document("Hello, world!", Language.English);
nlp.ProcessSingle(doc);

// Access the results
Console.WriteLine(doc.ToJson());

# Storage

Catalyst uses a storage mechanism to load and cache models. By default, it can download models from an online repository or load them from a local disk using DiskStorage.

Storage.Current = new DiskStorage("catalyst-models");

# Supported Languages

Below is a list of supported languages and their corresponding NuGet packages.

For a full list of supported languages, check the Languages folder in the repository or search NuGet for Catalyst.Models.