← back

Unified History Injection for Autoregressive Video Diffusion

#463 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement unified history injection for autoregressive video diffusion. Given the current noisy latent chunk, a set of clean history latents from previously generated chunks, and blending weights, inject the history context into the current chunk's denoising process.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def history_injection(
    current_latent: list[list[float]],
    history_latents: list[list[list[float]]],
    blend_weights: list[float],
    injection_mask: list[float]
) -> list[list[float]]:
    if not history_latents:
        return current_latent
    T, D = len(current_latent), len(current_latent[0])
    weighted = [[0.0] * D for _ in range(T)]
    total_w = 0.0
    for latent, w in zip(history_latents, blend_weights):
        for t in range(min(len(latent), T)):
            for d in range(D):
                weighted[t][d] += w * latent[t][d]
        total_w += w
    if total_w > 0:
        weighted = [[v / total_w for v in row] for row in weighted]
    return [[(1 - injection_mask[t]) * current_latent[t][d] + injection_mask[t] * weighted[t][d]
             for d in range(D)] for t in range(T)]

Explanation

  1. Weighted aggregation: Combine multiple history latents (from prior chunks at different temporal distances) using learned or scheduled blend weights. Normalize by total weight to maintain scale.
  2. Temporal alignment: Each history latent may span a different number of frames, so we align from the start and only blend up to the shortest length.
  3. Masked injection: The injection_mask controls where history information is injected. Values near 1 give more weight to history (for overlap regions), while values near 0 preserve the current denoising trajectory (for new frames).
  4. This approach unifies short-term (recent chunk overlap) and long-term (compressed history) conditioning into a single injection mechanism.

Complexity

  • Time: O(K T d) where K is the number of history latents, T is frames, and d is the latent dimension
  • Space: O(T * d) for the weighted history accumulator