#
Multimodal Search Tutorial
#
Multimodal Search Tutorial
This tutorial guides you through configuring and using multimodal search for images, audio, and video files in Curiosity Workspace.
#
Overview
Multimodal search allows you to:
- Search for text within images (OCR).
- Search for spoken words in audio and video (Speech-to-Text).
- Find visually similar images.
#
Prerequisites
- A running Curiosity Workspace.
- Sample media files (images, MP3s, MP4s).
- API keys for OCR or STT providers (if using external services).
#
Step 1: Configuring OCR for Images
- Go to Admin → NLP Configuration.
- Enable the OCR Engine.
- Select the supported languages.
- Upload an image with text to verify that the text is extracted and searchable.
#
Step 2: Configuring Speech-to-Text (STT)
- Go to Admin → AI Integrations.
- Configure a Speech-to-Text provider (e.g., OpenAI Whisper).
- Specify the audio/video formats to process.
- Upload an audio file. Once processed, you should be able to search for phrases within the transcript.
#
Step 3: Visual Similarity Search
- Ensure an embedding model for images is configured.
- Upload images to the workspace.
- Use the Similar Images feature in the UI to find visually related assets.
#
Troubleshooting
- Low OCR Quality: Ensure images have sufficient resolution and clear text.
- Processing Latency: Media processing can be resource-intensive; monitor your system's performance.
- Unsupported Formats: Verify that your files match the supported formats listed in the Multimodal Search Overview.