#
The Document Class
The Document class is the primary data structure in Catalyst. It represents the text being processed and stores all the linguistic annotations generated by the NLP pipeline.
#
Creating a Document
You can create a new document by providing the raw text and its language.
using Catalyst;
using Mosaik.Core;
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
#
Important Properties
Value: The original raw text of the document.Language: The language of the document.UID: A unique identifier for the document.Metadata: A dictionary for storing arbitrary key-value pairs associated with the document.Labels: A list of string labels (e.g., for document classification).Spans: An enumerable of allSpanobjects (sentences) in the document.SpansCount: The total number of spans in the document.TokensCount: The total number of tokens across all spans.EntitiesCount: The total number of recognized entities in the document.IsParsed: A boolean indicating if the document has been tokenized.
#
Key Methods
#
TokenizedValue
Returns the tokenized text of the document. You can optionally merge recognized entities into single tokens.
string tokenized = doc.TokenizedValue(mergeEntities: true);
#
ToTokenList
Flattens all tokens from all spans into a single list of IToken objects.
List<IToken> allTokens = doc.ToTokenList();
#
AddSpan
Manually adds a span to the document by specifying the start and end character indices.
var span = doc.AddSpan(0, 10);
#
ToStringWithReplacements
Allows you to generate a new string where recognized entities are replaced based on a custom function.
string anonymized = doc.ToStringWithReplacements(entity =>
{
if (entity.EntityType.Type == "Person") return "[REDACTED]";
return null; // Keep original
});
#
Clear
Removes all tokens and spans from the document, but keeps the raw text and metadata.
doc.Clear();
#
Serialization and Deserialization
Catalyst provides several ways to save and load documents.
#
JSON Serialization
You can easily convert a document to and from a JSON string.
// Serialize to JSON
string json = doc.ToJson();
// Deserialize from JSON
Document doc2 = Document.FromJson(json);
#
Binary Serialization (MessagePack)
For high-performance scenarios, Catalyst supports binary serialization using MessagePack. This is often used internally when storing models or large corpora.
#
Immutable Documents
The ImmutableDocument class provides an immutable, memory-efficient representation of a document. It is useful for scenarios where you want to ensure the document data is not changed after processing.
// Convert to ImmutableDocument
ImmutableDocument immutableDoc = doc.ToImmutable();
// Convert back to mutable Document
Document mutableDoc = immutableDoc.ToMutable();