Curiosity Workspaces

# Multimodal Search Tutorial

# Multimodal Search Tutorial

This tutorial guides you through configuring and using multimodal search for images, audio, and video files in Curiosity Workspace.

# Overview

Multimodal search allows you to:

Search for text within images (OCR).
Search for spoken words in audio and video (Speech-to-Text).
Find visually similar images.

# Prerequisites

A running Curiosity Workspace.
Sample media files (images, MP3s, MP4s).
API keys for OCR or STT providers (if using external services).

# Step 1: Configuring OCR for Images

Go to Admin → NLP Configuration.
Enable the OCR Engine.
Select the supported languages.
Upload an image with text to verify that the text is extracted and searchable.

# Step 2: Configuring Speech-to-Text (STT)

Go to Admin → AI Integrations.
Configure a Speech-to-Text provider (e.g., OpenAI Whisper).
Specify the audio/video formats to process.
Upload an audio file. Once processed, you should be able to search for phrases within the transcript.

# Step 3: Visual Similarity Search

Ensure an embedding model for images is configured.
Upload images to the workspace.
Use the Similar Images feature in the UI to find visually related assets.

# Troubleshooting

Low OCR Quality: Ensure images have sufficient resolution and clear text.
Processing Latency: Media processing can be resource-intensive; monitor your system's performance.
Unsupported Formats: Verify that your files match the supported formats listed in the Multimodal Search Overview.