# Multimodal Search Tutorial

# Multimodal Search Tutorial

This tutorial guides you through configuring and using multimodal search for images, audio, and video files in Curiosity Workspace.

# Overview

Multimodal search allows you to:

  • Search for text within images (OCR).
  • Search for spoken words in audio and video (Speech-to-Text).
  • Find visually similar images.

# Prerequisites

  • A running Curiosity Workspace.
  • Sample media files (images, MP3s, MP4s).
  • API keys for OCR or STT providers (if using external services).

# Step 1: Configuring OCR for Images

  1. Go to Admin → NLP Configuration.
  2. Enable the OCR Engine.
  3. Select the supported languages.
  4. Upload an image with text to verify that the text is extracted and searchable.

# Step 2: Configuring Speech-to-Text (STT)

  1. Go to Admin → AI Integrations.
  2. Configure a Speech-to-Text provider (e.g., OpenAI Whisper).
  3. Specify the audio/video formats to process.
  4. Upload an audio file. Once processed, you should be able to search for phrases within the transcript.

# Step 3: Visual Similarity Search

  1. Ensure an embedding model for images is configured.
  2. Upload images to the workspace.
  3. Use the Similar Images feature in the UI to find visually related assets.

# Troubleshooting

  • Low OCR Quality: Ensure images have sufficient resolution and clear text.
  • Processing Latency: Media processing can be resource-intensive; monitor your system's performance.
  • Unsupported Formats: Verify that your files match the supported formats listed in the Multimodal Search Overview.