Skip to content

Multimodal Agents

MultiModalAgent (buddy.multimodal) extends Agent with structured handling for images, audio, and video alongside text. It defines typed analysis results and a fusion pipeline for reasoning across modalities.

Optional dependency

Vision and audio processing require the multimodal extra:

pip install "buddy-ai[multimodal]"

This installs Pillow, opencv-python, librosa, and soundfile. The feature is gated — confirm it with check_feature("multimodal").

from buddy import check_feature, MultiModalAgent, ModalityType, MultiModalResponse
assert check_feature("multimodal")

Modalities

ModalityType enumerates supported inputs:

Member Value
ModalityType.TEXT "text"
ModalityType.IMAGE "image"
ModalityType.AUDIO "audio"
ModalityType.VIDEO "video"

A MultiModalAgent always supports TEXT. Image, audio, and video become available when the corresponding model wrapper (vision_model, audio_model, video_model) is configured on the agent. Inspect what is enabled with the supported_modalities property:

from buddy.models.openai import OpenAIChat

agent = MultiModalAgent(model=OpenAIChat(id="gpt-4o"))
print(agent.supported_modalities)   # [ModalityType.TEXT] until others are set

Modality guards

Calling process_image, process_audio, or process_video without the matching model raises ValueError ("… processing not available. Configure *_model."). Configure the relevant model wrapper before invoking these methods.

Processing methods

Method Returns Analysis types
process_image(image, prompt, analysis_type) ImageAnalysis general, detailed, objects, text, faces
process_audio(audio, prompt, analysis_type) AudioAnalysis transcription, analysis, speaker, emotion
process_video(video, prompt, analysis_type) VideoAnalysis general, activities, objects, faces, audio
multimodal_understanding(inputs, prompt, fusion_strategy) MultiModalResponse early, late, hybrid

Inputs are flexible: images accept a file path, raw bytes, a data: URL, or a PIL Image; audio and video accept a path or bytes.

Result schemas

The analysis result models are Pydantic types you can rely on:

  • ImageAnalysisdescription, objects (ObjectDetection), faces (FaceDetection), text_content (OCRResult), scene_type, colors, tags, confidence_score.
  • AudioAnalysistranscription, segments (SpeechSegment), speakers, language, sentiment, emotions.
  • VideoAnalysisdescription, duration, frame_rate, resolution, activities (ActivityDetection), objects_timeline, faces_timeline, audio_analysis.
  • MultiModalResponsetext_response, modality_analyses, cross_modal_insights, confidence_score, processing_time.

Cross-modal understanding

multimodal_understanding runs each modality, optionally fuses them, and asks the agent's base model to produce a unified answer:

response = agent.multimodal_understanding(
    inputs={
        ModalityType.TEXT: "Describe what's happening in this scene.",
        ModalityType.IMAGE: "scene.jpg",
    },
    prompt="Summarize the scene for an accessibility caption.",
    fusion_strategy="hybrid",
)
print(response.text_response)
print(response.cross_modal_insights)

The modality_weights attribute controls how much each modality contributes to the overall confidence_score (text 1.0, video 0.9, image 0.8, audio 0.7 by default).

Provider wiring

The processing methods return fully typed analysis objects, but the actual detection/transcription quality depends on the model wrapper you attach. Plug a provider-backed vision/audio/video model into vision_model / audio_model / video_model to get real results rather than schema defaults.