Look, I’m a backend engineer. I don’t have time to read through 40 pages of model cards before picking an API. I just need to know: which multimodal model handles my use case without breaking the bank or my sanity?
So I spent a weekend testing every model I could get my hands on via a unified endpoint (shout-out to Global API for not making me manage ten different provider keys). Here’s what I found, some code you can steal, and the honest trade-offs.
The Contenders
I stuck with the same lineup that’s been floating around the Hacker News threads lately—mostly Chinese labs, because let’s be real, they’re the ones shipping open-weight multimodal models that actually compete. The full list (with prices I didn’t invent):
Model
Provider
Modalities
Output $/M tokens
Context window
Qwen3-VL-32B
Qwen
Image + Text
$0.52
32K
Qwen3-VL-30B-A3B
Qwen
Image + Text
$0.52
32K
Qwen3-VL-8B
Qwen
Image + Text
$0.50
32K
Qwen3-Omni-30B
Qwen
Image + Audio + Video + T
Discussion
Get the discussion rolling
A single comment can start something great.