LLMs are significantly cheaper if you only append the tokens.
- LLMs are significantly cheaper if you only append the tokens.
- If you only append tokens, you can reuse the existing KVCache from earlier runs instead of having to regenerate it.
- That can be a quadratic speed-up over the whole generation, since otherwise for token n you have to generate n-1 and n-2 all the way back down to 0.
- That's one of the reasons various UIs lean on chat as the concept, since you can only append to chat, keeping you naturally in the cache.