Skip to content

Message Types

Enum: MessageType

Defined in messaging/base.py:

Value Description
TEXT Plain text message
IMAGE Image message
FILE Document/file attachment
AUDIO Audio/voice message
VIDEO Video message

Processing Rules

AUDIO

  • If MessageType.AUDIO AND message.text is empty → audio is transcribed via Gemini 2.5 Flash (Vertex AI), transcript replaces message.text
  • If MessageType.AUDIO AND message.text is present → audio is IGNORED, text is processed as-is
  • Metadata keys: audio_data (bytes), audio_mime_type (str)

FILE

  • File/DOCUMENT messages keep their text (if any) — the text is treated as a question about the file
  • If no text, default prompt is "Analise este arquivo."
  • File bytes are passed as Part.from_bytes() to ADK — Gemini 2.5 Flash natively processes PDFs, images, CSVs, DOCX, etc.
  • Text is passed as Part.from_text()
  • Metadata keys: file_data (bytes), file_mime_type (str), file_name (str)

TEXT

  • Processed normally as the user's message

Adapter Implementation

WhatsApp Official (audio)

  • Detects type:"audio" → calls _download_media(media_id) → 2 API calls (GET media info → GET download URL)
  • Detects type:"document" → same _download_media() flow

WhatsApp Evolution (audio)

  • Detects messageType:"audioMessage" or "ptt" → calls _download_audio_media() via Evolution /message/getMedia endpoint
  • Detects messageType:"documentMessage" → calls _download_media_message()

Telegram (audio)

  • Detects message.voice or message.audio → calls _download_audio(file_id) via bot.get_file() — only downloads if no text present
  • Detects message.document → calls _download_file(file_id) — keeps both text and file

WebChat (audio/file)

  • Detects audio_data (base64) + audio_mime_type in JSON body → decodes via _decode_audio()
  • Detects file_data (base64) + file_mime_type + file_name → decodes via _decode_file() (max 20MB)

Audio Transcriber

File: messaging/audio/transcriber.py

Uses Gemini 2.5 Flash via Vertex AI (not GCP Speech-to-Text): - Direct API call (not through ADK) to avoid audio token cost in main conversation - Language hint detection: checks first 200 bytes for Portuguese accent chars (ãáàâãéêíóôõúç) → hints ["pt-BR", "en-US", "es-ES"] - Max file size: 10MB (configurable via AUDIO_MAX_BYTES) - Returns None on failure (processor falls back to error message)