KV cache memory calculator: how much does your LLM actually use?
Before you can compress something, you need to know how big it is. Most engineers know the KV cache is "large" but few have actually calculated the exact number. This post gives you the formula, a ...

Source: DEV Community
Before you can compress something, you need to know how big it is. Most engineers know the KV cache is "large" but few have actually calculated the exact number. This post gives you the formula, a table for popular models, and a one-liner to compute it yourself. The formula KV cache bytes = 2 × L × H × d × T × 2 Where: 2 — one K tensor and one V tensor L — number of transformer layers H — number of attention heads (or KV heads for GQA models) d — head dimension (= hidden_size / num_heads) T — sequence length in tokens 2 — bytes per value in FP16 That's it. No approximation. This is the exact allocation. Memory table: popular models Llama-3-8B (L=32, H=8 KV heads, d=128) Context KV cache 4K tokens 0.5 GB 32K tokens 4 GB 128K tokens 16 GB Mistral-7B (L=32, H=8 KV heads, d=128) Context KV cache 4K tokens 0.5 GB 32K tokens 4 GB 128K tokens 16 GB Llama-3-70B (L=80, H=8 KV heads, d=128) Context KV cache 4K tokens 5 GB 32K tokens 40 GB 128K tokens 160 GB Mixtral-8x7B (L=32, H=8 KV heads, d=12