In order to search on images and scanned files, you can enable automatic OCR using the out of the box support for the AWS Textract APIs on your Curiosity application.

You will need:

  • An active Amazon Web Services account
  • The AWS region you would like to consume the API from.
  • A valid AWS Key ID and Secret Access Key with access to the AWS Textract API

Note: The AWS Textract API is an offer from Amazon - you can check the prices for using the API on their website. AWS offers a free tier that you can use to test the API.

Setup your AWS Textract account

Please follow the official guide from AWS to setup your account for AWS Textract, create the required IAM user with access to the API, and generate the Secret Access Key. The IAM user must have the role AmazonTextractFullAccess added to it.

To retrieve the Access Key ID and Secret Access Key values, navigate to the AWS IAM Users page, select the user you created, and go to the tab Security credentials. Create an access key & secret pair. Remember to store the values in a safe place, as you will not be able to retrieve the secret key again.

Configuring AWS Textract on Curiosity

On your Curiosity application, navigate to Settings > Data > OCR Settings, and enter the values for AWS Region, AWS Access Key ID and AWS Secret Access Key. Click on Save to store all changes. The value you enter for AWS Region must be a valid region identifier such as us-west-1 or eu-central-1.

If you have already added files in your system, you might want to reprocess these files so that they will be OCR-ed as required. For that, you can run this simple query on the Shell interface :

Q().StartAt("_FileEntry").Tx().Set("Indexed", false).Commit();

Did this answer your question?