Preamble

We know it might be a bit confusing when mixing files and NLP to define what is a document. In this section, Document is referred to the raw text and associated metadata that is created when passing a raw piece of text through a Natural Language Processing Pipeline. Curiosity uses the open-source Catalyst NLP library as the foundation of it's machine learning capabilities. In the graph database, NLP documents stored using the protected node type _Document.

NLP documents are different from files in several ways:

  • Files may have many documents, whereas a document can represent a piece of text (usually a page or paragraph) of a given file.
  • A document need not be related to a file, e.g. it could be created from a text field of a given node on the graph
  • Files contain more data than the pure text, e.g. metadata (author, date, etc:), images, tables, and more

The process of transforming a file into NLP Documents is fully automated in a Curiosity instance.

Introduction

There are a few key concepts that are important when trying to understand and design NLP models.

Tokens

Tokens are the basic units for text processing in NLP; they are the "words" of the system. Tokens are extracted from the raw text in the document using a tokenizer model that tells the system where one "word" ends and the next begins.

Tokenization is a complex process, and the output tokens can be different from "normal" words because they can include:

  • Numbers, dates and other formats, e.g. 122.79 or 2020-04-23
  • References separated by special characters that would normally denote the end of a word, e.g. A320-200 or acronyms like A.B.C.
  • Captured entities when iterating over the captured tokens of a document, e.g. iPhone XS, New York or REF 15.02472

Sentences

Sentences (also known as Spans) are you would expect them to be: Short pieces of text separated by punctuation (or special characters like carriage returns). Sentence detection is used performed after tokenization but before other models such as entity detection.

Entities

Entities represent Tokens that have been marked as of interest. They are captured using Named Entity Recognition (NER) models, and are attached to metadata such as Entity Type. They typically include things like Names, Reference Numbers, Error Codes, Dates, Currencies, and are used in a Curiosity system to enrich your data and build up your knowledge graph.

Document

A document is a collection of raw text (before being processed) and tokenized Tokens, Sentences boundaries and captured Entities.

Models and Pipelines

Models are deterministic or machine learning algorithms that that perform a specific operation on a given NLP Document. A collection of models that run in sequence to each other is called a Pipeline. Models are used for multiple tasks on NLP, such as splitting the raw text in Tokens, capturing Entities, deciding on the role of each word in a sentence (Part of Speech Tagging). Models are usually language dependent (i.e. a model for English is different than a model designed or trained for German).

There are other types of NLP models such as text similarity and classification, but they usually run separately from a Pipeline.

A typical Pipeline setup consists of a set of models such as:

  • Language Detection
  • Tokenization
  • Sentence Detection
  • Entity Capturing

For more information on models, check the Natural Language Processing articles.

Did this answer your question?