Author: Jia Jingqiu - English web edition: 2026-07-01
Original canonical version: https://www.jiajingqiu.com/agent-skills-multimodal/
Abstract
Multimodal agents are moving from "can call tools" toward "can execute complex workflows reliably." In production, the bottleneck is often not model capability but skill design: when a skill should trigger, how much context it should load, which steps it should follow, how it should choose branches, how it should avoid stale instructions, and how success should be evaluated. Inspired by Matt Pocock's framework for writing great agent skills, this paper proposes a multimodal multi-skill architecture for commerce workflows. Instead of a single large skill, a production system should compose small, inspectable skills for product truth, evidence gating, routing, keyframes, generation prompts, video QA, publication acceptance and failure memory. The goal is not more instructions. The goal is a more predictable process.
1. Thesis:
Discussion
Get the discussion rolling
A single comment can start something great.