Guidance Attention Mask for Chunked Video

#461 · Deep Learning · Medium

Problem

Implement a guidance attention mask for chunked autoregressive video diffusion. Given the current chunk index, total number of frames per chunk, and the overlap size, produce a binary attention mask that allows the current chunk's frames to attend to the conditioning (overlap) frames from the previous chunk and to each other, while blocking attention to future chunks.

Solution

def guidance_attention_mask(
    chunk_size: int, overlap_size: int, has_previous: bool
) -> list[list[float]]:
    total = overlap_size + chunk_size if has_previous else chunk_size
    mask = [[0.0] * total for _ in range(total)]
    if has_previous:
        for i in range(overlap_size):
            for j in range(overlap_size):
                mask[i][j] = 1.0
        for i in range(overlap_size, total):
            for j in range(overlap_size):
                mask[i][j] = 1.0
            for j in range(overlap_size, i + 1):
                mask[i][j] = 1.0
    else:
        for i in range(total):
            for j in range(i + 1):
                mask[i][j] = 1.0
    return mask

Explanation

When there is a previous chunk, the first overlap_size tokens are conditioning frames carried over. They can attend to each other freely.
New frames in the current chunk (indices overlap_size to total - 1) can attend to all overlap frames (for temporal conditioning) and to all preceding frames in the current chunk (causal within the chunk).
When there is no previous chunk (first chunk), we use standard causal masking where each frame attends to itself and all prior frames.
The mask is 1.0 where attention is allowed and 0.0 where it is blocked.

Complexity

Time: O(n^2) where n = overlap_size + chunk_size
Space: O(n^2) for the attention mask

← #460 #462 →