Curiosity automatically extracts the full text content for search indexing from the following file types:

  • PDF documents (.pdf)
  • Postscript documents (.ps)
  • Word documents (.doc, .docx, .dot, .dotx, .docm, .dotm)
  • Powerpoint slides (.ppt, .pptx, .pps, .ppsx, .potx, .ppam, .ppsm, .pptm, .potm, .ppa)
  • Excel sheets (.xls, .xlt, .xla, .xlsx, .xltx, .xlsm, .xltm, .xlam, .xlsb)
  • Email files (.msg, .eml)
  • Visio drawings (.vsd)
  • Open XML Paper documents (.xps)
  • Images (.gif, .jpg. .png, .bmp, .tiff) with OCR support if configured
  • Autocad drawings (.dxf, .dwg)
  • Photoshop images (.psd)
  • Webpages (.html)
  • Plain Text (.txt)

For non-pdf file types, Curiosity will also generate thumbnails and when supported a PDF preview that allows easy access to the file content from within the browser without having to download the file to your computer.

Search on images using OCR

If your Curiosity system has been configured to perform OCR (using one of the supported models: AWS Textract, Azure Cloud Vision and On-Premise Azure Cloud Vision), you'll also be able to search on the content of images and scanned PDFs once they've been processed. Curiosity will automatically process your files and enrich the PDF previews with the extracted text, so that it is searchable within your system just like any other document.

Did this answer your question?