I gave GLM-4.5-Air (106B, open weights) 12 coding tasks through opencode on my RTX 3090. It scored 0% — never edited a single file.
Same model, same GPU, same tasks, but driven by a ~150-line LangGraph agent instead: 93%.
The model was never the problem. The orchestrator was. Here's the benchmark — including the part nobody else measures, the electricity cost per correct task.
Setup
RTX 3090 (24 GB) + 128 GB RAM, models via ollama, Q4 quants, temp 0.2
5 recent open models × 2 orchestrators (opencode vs custom LangGraph ReAct with ollama-native tool-calling)
17 graded tasks (12 coding in Python/JS/C++ + 5 general-agent) with hidden unit tests
Every run priced in GPU watts via my open-source homelab-monitor
Results
Model
tok/s
opencode adh.
LangGraph adh.
LangGraph coding
LangGraph general
Qwen3-Coder 30B-A3B
130
92%
100%
100%
100%
GLM-4.5-Air 106B
5.7
0%
100%
89%
100%
Devstral Small 24B
49
8%
53%
8%
40%
Seed-OSS 36B
9.5
0%
7%
0%
20%
Discussion
Say something first
It all starts with you—share your thoughts now.