Build a multi-modal agent in 50 lines

The promise of "every modality through one API" is hard to internalize until you write a small agent that uses three of them in 50 lines. Here's that agent.

It looks at an image (vision), decides what tool to call (tool calling), then speaks the answer (text-to-speech). One key, one client, three different models.

The agent

from openai import OpenAI
import json
import base64

client = OpenAI(
    base_url="https://api.aigateway.sh/v1",
    api_key="sk-aig-...",
)

def look_at(image_path: str, question: str) -> str:
    img_b64 = base64.b64encode(open(image_path, "rb").read()).decode()
    resp = client.chat.completions.create(
        model="anthropic/claude-opus-4.7",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{img_b64}"
                }},
            ],
        }],
        tools=[{
            "type": "function",
            "function": {
                "name": "answer",
                "description": "Return the final answer to the user.",
                "parameters": {
                    "type": "object",
                    "properties": {"text": {"type": "string"}},
                    "required": ["text"],
                },
            },
        }],
        tool_choice={"type": "function", "function": {"name": "answer"}},
    )
    args = json.loads(resp.choices[0].message.tool_calls[0].function.arguments)
    return args["text"]

def speak(text: str, out_path: str = "out.mp3") -> None:
    with client.audio.speech.with_streaming_response.create(
        model="deepgram/aura-2-en",
        voice="orion",
        input=text,
    ) as r:
        r.stream_to_file(out_path)

if __name__ == "__main__":
    answer = look_at("photo.jpg", "What's in this picture?")
    print(answer)
    speak(answer)
    print("wrote out.mp3")

Why this matters

Three different providers, three different request shapes, three different response shapes — collapsed into one client and one key. If you wanted to A/B Opus 4.7 against GPT-5.4 for the vision step, you change one string. If you wanted to swap Aura 2 for ElevenLabs, you change one string.

The agent doesn't know or care which provider is upstream. That's the entire pitch of the aggregator: your code is portable across every model in the catalog.

Add a x-aig-tag header per call so you can see what the vision step costs vs the TTS step. When a new vision model lands, run an eval against your real questions and swap if it wins.

Ready to ship?

One paid key. Every model in the catalog. Pass-through pricing, pay-as-you-go.

Get a key →API reference

Use Moonshot's Kimi K2.6 (open frontier agent model, 256K context, vision, native tool calling) from the OpenAI SDK by changing only the base_url. Try it on your $5 signup credit — no card required.

Track per-feature AI cost with cost tags

Send one extra header on every request and AIgateway aggregates spend per feature, per tenant, or per any free-form string. Read it back from /v1/usage/by-tag.

Build a multi-modal agent in 50 lines

The agent

Why this matters

Next

Related guides