← back

KV Cache Memory Budget and Eviction Policy

#435 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement a KV cache memory budget manager with eviction policies. Given a fixed memory budget (in bytes), manage KV cache entries for multiple sequences. Support eviction policies: LRU (least recently used), LFU (least frequently used), and a token-importance-based policy. Track cache utilization and eviction statistics.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def kv_cache_manager(
    memory_budget_bytes: int,
    entry_size_bytes: int,
    operations: list[dict],
    eviction_policy: str = "lru"
) -> dict:
    """
    operations: list of {"type": "read"|"write"|"evict_all", "key": str, ...}
    entry_size_bytes: bytes per KV cache entry (per token per layer)
    """
    max_entries = memory_budget_bytes // entry_size_bytes

    cache = {}        # key -> {"last_access": int, "frequency": int, "importance": float}
    access_counter = 0
    stats = {"hits": 0, "misses": 0, "evictions": 0, "writes": 0}

    def evict_one():
        if not cache:
            return
        if eviction_policy == "lru":
            victim = min(cache, key=lambda k: cache[k]["last_access"])
        elif eviction_policy == "lfu":
            victim = min(cache, key=lambda k: (cache[k]["frequency"], cache[k]["last_access"]))
        elif eviction_policy == "importance":
            victim = min(cache, key=lambda k: cache[k]["importance"])
        else:
            victim = min(cache, key=lambda k: cache[k]["last_access"])
        del cache[victim]
        stats["evictions"] += 1

    results = []
    for op in operations:
        access_counter += 1

        if op["type"] == "read":
            key = op["key"]
            if key in cache:
                cache[key]["last_access"] = access_counter
                cache[key]["frequency"] += 1
                stats["hits"] += 1
                results.append({"key": key, "status": "hit"})
            else:
                stats["misses"] += 1
                results.append({"key": key, "status": "miss"})

        elif op["type"] == "write":
            key = op["key"]
            importance = op.get("importance", 1.0)
            if key in cache:
                cache[key]["last_access"] = access_counter
                cache[key]["frequency"] += 1
                cache[key]["importance"] = importance
            else:
                while len(cache) >= max_entries:
                    evict_one()
                cache[key] = {
                    "last_access": access_counter,
                    "frequency": 1,
                    "importance": importance
                }
                stats["writes"] += 1
            results.append({"key": key, "status": "written"})

        elif op["type"] == "evict_all":
            stats["evictions"] += len(cache)
            cache.clear()
            results.append({"status": "cleared"})

    total_ops = stats["hits"] + stats["misses"]
    hit_rate = stats["hits"] / total_ops if total_ops > 0 else 0.0

    return {
        "stats": stats,
        "hit_rate": round(hit_rate, 4),
        "cache_utilization": round(len(cache) / max_entries * 100, 2) if max_entries > 0 else 0,
        "current_entries": len(cache),
        "max_entries": max_entries,
        "memory_used_bytes": len(cache) * entry_size_bytes,
        "eviction_policy": eviction_policy
    }

Explanation

  1. The KV cache stores key and value tensors for previously processed tokens. Its memory grows linearly with sequence length and batch size.
  2. When the memory budget is exceeded, entries must be evicted:
  3. - LRU: evicts the entry least recently accessed, assuming recent tokens are more relevant.
  4. - LFU: evicts the entry with the fewest accesses, keeping frequently referenced tokens.
  5. - Importance-based: evicts the entry with the lowest assigned importance score (e.g., based on attention weights).
  6. Each entry represents one token's K and V tensors across all layers. Entry size = 2 n_layers n_kv_heads head_dim dtype_bytes.
  7. The manager tracks hits, misses, and evictions to measure cache efficiency.
  8. Advanced systems like PagedAttention (vLLM) manage KV cache in fixed-size blocks rather than per-token entries for better memory utilization.

Complexity

  • Time: O(n * m) where n is the number of operations and m is cache size (for eviction search)
  • Space: O(m) for the cache entries