#
Multimodal Search
Curiosity Workspace includes capabilities to extract searchable text from images, audio, and video files using Optical Character Recognition (OCR) and Speech-to-Text (STT).
#
Optical Character Recognition (OCR)
OCR extracts text from scanned documents and images, making them searchable within the workspace.
#
Supported Image Formats
Curiosity can process the following image types for OCR:
- Common Images:
.png,.jpg,.jpeg,.gif,.bmp,.webp,.svg - Professional Formats:
.tif,.tiff,.dng,.raw,.heic,.heif,.psb - Other:
.odg,.otg,.odi - PDF Scans:
.pdffiles containing image-only content.
#
Multi-Language Support
The OCR engine supports multiple languages, including English, French, Spanish, German, and Portuguese.
#
Speech-to-Text (STT)
STT converts spoken language from audio and video files into searchable text. Curiosity leverages Whisper models for high-accuracy transcription.
#
Supported Video Formats
.mp4,.wmv,.mpeg,.avi,.mkv,.mov,.ogv,.3gp,.flv
#
Supported Audio Formats
.mp3,.wav,.mka,.wma,.flac,.aac,.aiff,.m4a,.oga,.weba,.webm
#
Features
- Searchable Transcripts: Find specific words spoken within a video or audio file.
- Timestamped Navigation: Jump directly to the relevant part of the media file from search results.
- Language Detection: Automatically recognizes and transcribes a broad range of languages.
See also
Curiosity Workspace supports AI-assisted features that are grounded in your workspace data. In practice, this usually means:
Curiosity Workspace enables you to build sophisticated AI agents that interact with your data and perform tasks using Large Language Models (LLMs).
This tutorial guides you through configuring and using multimodal search for images, audio, and video files in Curiosity Workspace.