# The Document Class

The Document class is the primary data structure in Catalyst. It represents the text being processed and stores all the linguistic annotations generated by the NLP pipeline.

# Creating a Document

You can create a new document by providing the raw text and its language.

using Catalyst;
using Mosaik.Core;

var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);

# Important Properties

  • Value: The original raw text of the document.
  • Language: The language of the document.
  • UID: A unique identifier for the document.
  • Metadata: A dictionary for storing arbitrary key-value pairs associated with the document.
  • Labels: A list of string labels (e.g., for document classification).
  • Spans: An enumerable of all Span objects (sentences) in the document.
  • SpansCount: The total number of spans in the document.
  • TokensCount: The total number of tokens across all spans.
  • EntitiesCount: The total number of recognized entities in the document.
  • IsParsed: A boolean indicating if the document has been tokenized.

# Key Methods

# TokenizedValue

Returns the tokenized text of the document. You can optionally merge recognized entities into single tokens.

string tokenized = doc.TokenizedValue(mergeEntities: true);

# ToTokenList

Flattens all tokens from all spans into a single list of IToken objects.

List<IToken> allTokens = doc.ToTokenList();

# AddSpan

Manually adds a span to the document by specifying the start and end character indices.

var span = doc.AddSpan(0, 10);

# ToStringWithReplacements

Allows you to generate a new string where recognized entities are replaced based on a custom function.

string anonymized = doc.ToStringWithReplacements(entity =>
{
    if (entity.EntityType.Type == "Person") return "[REDACTED]";
    return null; // Keep original
});

# Clear

Removes all tokens and spans from the document, but keeps the raw text and metadata.

doc.Clear();

# Serialization and Deserialization

Catalyst provides several ways to save and load documents.

# JSON Serialization

You can easily convert a document to and from a JSON string.

// Serialize to JSON
string json = doc.ToJson();

// Deserialize from JSON
Document doc2 = Document.FromJson(json);

# Binary Serialization (MessagePack)

For high-performance scenarios, Catalyst supports binary serialization using MessagePack. This is often used internally when storing models or large corpora.

# Immutable Documents

The ImmutableDocument class provides an immutable, memory-efficient representation of a document. It is useful for scenarios where you want to ensure the document data is not changed after processing.

// Convert to ImmutableDocument
ImmutableDocument immutableDoc = doc.ToImmutable();

// Convert back to mutable Document
Document mutableDoc = immutableDoc.ToMutable();