You must have seen it every time you use ChatGPT or Claude that the first token takes noticeably longer to appear. Then the rest stream out almost instantly.
Behind the scenes, it's a deliberate engineering decision called KV caching, and the purpose is to make LLM inference faster.
Before we get into the technical details, here's a side-by-side comparison of LLM inference with and without KV caching: 0:01 / 0:47
Now let's understand how it works, from first principles.
Part 1: How LLMs generate tokens
The transformer processes all input tokens and produces a hidden state for each one. Those hidden states get projected into vocabulary space, producing logits (one score per word in the vocabulary).
But only the logits from the last token matter. You sample from them, get the next token, append it to the input, and repeat.
This is the key insight: to generate the next token, you only need the hidden state of the most recent token. Every other hidden state is an intermediate byprod
Discussion
Your thoughts matter!
Your input is valuable—be the first to share it!