Implement a guidance attention mask for chunked autoregressive video diffusion. Given the current chunk index, total number of frames per chunk, and the overlap size, produce a binary attention mask that allows the current chunk's frames to attend to the conditioning (overlap) frames from the previous chunk and to each other, while blocking attention to future chunks.
def guidance_attention_mask(
chunk_size: int, overlap_size: int, has_previous: bool
) -> list[list[float]]:
total = overlap_size + chunk_size if has_previous else chunk_size
mask = [[0.0] * total for _ in range(total)]
if has_previous:
for i in range(overlap_size):
for j in range(overlap_size):
mask[i][j] = 1.0
for i in range(overlap_size, total):
for j in range(overlap_size):
mask[i][j] = 1.0
for j in range(overlap_size, i + 1):
mask[i][j] = 1.0
else:
for i in range(total):
for j in range(i + 1):
mask[i][j] = 1.0
return maskoverlap_size tokens are conditioning frames carried over. They can attend to each other freely.overlap_size to total - 1) can attend to all overlap frames (for temporal conditioning) and to all preceding frames in the current chunk (causal within the chunk).