Curiosity provides ways to train and consume similarity models from within the system - and externally via custom endpoints.

This article will cover how to train and consume similar models on the user interface. You can refer to the documentation on endpoints for examples on how to expose similarity outside of your Curiosity application.

Training a similarity model

Our recommended model for similarity, based on our experience with enterprise data, is built upon the graph relationships extracted from your data on your knowledge graph. The advantage of building the similarity model this way is that one can guide the model to learn from features that are important to define similarity for your use-case. Out of the box, we use the _Token and _Abbreviation nodes that are usually captured already on every Curiosity application, but you can configure this as you like depending on your use-case and data.

To get started, open the Data Hub and click on Train Similarity option under Quick Actions:

You will be presented with a pop-up to choose which data type you want to train a model for. Select it from the list (for example select File for training a similarity model that can be used across all files)

Once you selected a data type, you'll see the option to create the required embedding index. Click on Create Index to start.

You'll be presented with the options for the index. If you are training more than one index for a given data type, you will need to add an identifier tag in the Extra Tag field - otherwise leave it empty.

Use the options Edges To Follow and Nodes To Follow to define which relationships and neighboring data types are relevant for defining similarity on your data:

For example on our online Space Library demo, one can configure it such that the following data types are contributing for similarity:

  • Abbreviations

  • Asteroids

  • Tokens

  • Authors

  • Missions

  • Organizations

This can be defined as follows:

You don't need to specify both Nodes and Edges to follow fields, but in case the relationship types are important on your use-case, use both to narrow down the data to use as input for training similarity.

Finally, click on Save to create the index and then on Train to train your initial model:

Depending on how much data you have, the training might take anything from a few minutes to a few hours. You can follow the training from the user interface, or check it on the logs.

Retraining a model

Once trained, you don't need to retrain the model unless you add a significant volume of new data or relationships to your data. New data will use the existing model to create the required embedding vector so that it is indexed for similarity search.

In case you have a constant influx of data, you can also create a scheduled task to retrain your model periodically.

Exploring Similarity in the User Interface

The easiest way to enable your users to use the similarity model you just trained for getting suggestions is to add a Similar component on the Node Renderer settings for your data type. For example, you can use the Tab component and add a new tab with the similar data:

Did this answer your question?