← back

KV Cache Tiered Offloading Simulator

#436 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement a KV Cache Tiered Offloading Simulator. Given a list of token access events (each with a token ID and timestamp), a fast-tier (GPU) cache capacity, and an eviction policy, simulate offloading cold KV cache entries from the fast tier to a slow tier (CPU/host memory). Return the number of fast-tier hits, slow-tier hits, and misses.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def kv_cache_tiered_offload(
    accesses: list[tuple[int, int]],
    fast_capacity: int,
    slow_capacity: int
) -> dict[str, int]:
    from collections import OrderedDict

    fast_cache: OrderedDict[int, int] = OrderedDict()
    slow_cache: OrderedDict[int, int] = OrderedDict()

    fast_hits = 0
    slow_hits = 0
    misses = 0

    for token_id, timestamp in accesses:
        if token_id in fast_cache:
            fast_hits += 1
            fast_cache.move_to_end(token_id)
            fast_cache[token_id] = timestamp
        elif token_id in slow_cache:
            slow_hits += 1
            slow_cache.pop(token_id)
            if len(fast_cache) >= fast_capacity:
                evicted_key, evicted_ts = fast_cache.popitem(last=False)
                if len(slow_cache) >= slow_capacity:
                    slow_cache.popitem(last=False)
                slow_cache[evicted_key] = evicted_ts
            fast_cache[token_id] = timestamp
        else:
            misses += 1
            if len(fast_cache) >= fast_capacity:
                evicted_key, evicted_ts = fast_cache.popitem(last=False)
                if len(slow_cache) >= slow_capacity:
                    slow_cache.popitem(last=False)
                slow_cache[evicted_key] = evicted_ts
            fast_cache[token_id] = timestamp

    return {
        "fast_hits": fast_hits,
        "slow_hits": slow_hits,
        "misses": misses
    }

Explanation

  1. Maintain two ordered dictionaries representing the fast tier (GPU) and slow tier (CPU) caches, both using LRU eviction order.
  2. For each access, check the fast cache first. If found, it is a fast-tier hit and we refresh its position.
  3. If found in the slow cache, it is a slow-tier hit. Promote the entry to the fast cache, evicting the LRU fast-cache entry to the slow tier if needed.
  4. On a miss, load the token into the fast cache, offloading the LRU entry to the slow tier when the fast cache is full.
  5. If the slow tier is also full, evict its LRU entry entirely.

Complexity

  • Time: O(n) where n is the number of access events
  • Space: O(fast_capacity + slow_capacity)