guides/agents

Build a multi-modal agent in 50 lines

7 min readpublished 2026-04-22category · Agents

A small Python agent that uses vision, tool calling, and TTS — all through one AIgateway key, switching modalities by changing the model slug.

The promise of "every modality through one API" is hard to internalize until you write a small agent that uses three of them in 50 lines. Here's that agent.

It looks at an image (vision), decides what tool to call (tool calling), then speaks the answer (text-to-speech). One key, one client, three different models.

The agent

from openai import OpenAI
import json
import base64

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

def look_at(image_path: str, question: str) -> str:
    img_b64 = base64.b64encode(open(image_path, "rb").read()).decode()
    resp = client.chat.completions.create(
        model="anthropic/claude-opus-4.7",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{img_b64}"
                }},
            ],
        }],
        tools=[{
            "type": "function",
            "function": {
                "name": "answer",
                "description": "Return the final answer to the user.",
                "parameters": {
                    "type": "object",
                    "properties": {"text": {"type": "string"}},
                    "required": ["text"],
                },
            },
        }],
        tool_choice={"type": "function", "function": {"name": "answer"}},
    )
    args = json.loads(resp.choices[0].message.tool_calls[0].function.arguments)
    return args["text"]

def speak(text: str, out_path: str = "out.mp3") -> None:
    with client.audio.speech.with_streaming_response.create(
        model="deepgram/aura-2-en",
        voice="orion",
        input=text,
    ) as r:
        r.stream_to_file(out_path)

if __name__ == "__main__":
    answer = look_at("photo.jpg", "What's in this picture?")
    print(answer)
    speak(answer)
    print("wrote out.mp3")

Why this matters

Three different providers, three different request shapes, three different response shapes — collapsed into one client and one key. If you wanted to A/B Opus 4.7 against GPT-5.4 for the vision step, you change one string. If you wanted to swap Aura 2 for ElevenLabs, you change one string.

The agent doesn't know or care which provider is upstream. That's the entire pitch of the aggregator: your code is portable across every model in the catalog.

Next

Add a x-aig-tag header per call so you can see what the vision step costs vs the TTS step. When a new vision model lands, run an eval against your real questions and swap if it wins.

READY TO SHIP?
Get an AIgateway key in 30 seconds. Free Kimi K2.6 through Apr 30; everything else is pass-through.
Get a key →API reference

Related guides