Video Generation Latent Space Memory Estimation

#446 · Deep Learning · Medium

Problem

Estimate the GPU memory required to store the latent space representation during video generation. Given the video resolution, number of frames, the spatial and temporal downsampling factors of the VAE, the number of latent channels, and the data type, compute the latent tensor shape and its memory footprint.

Solution

def video_latent_memory(
    height: int,
    width: int,
    num_frames: int,
    spatial_downsample: int,
    temporal_downsample: int,
    latent_channels: int,
    bytes_per_element: int = 2,
    batch_size: int = 1
) -> dict:
    latent_h = height // spatial_downsample
    latent_w = width // spatial_downsample
    latent_t = num_frames // temporal_downsample

    num_elements = batch_size * latent_channels * latent_t * latent_h * latent_w

    memory_bytes = num_elements * bytes_per_element
    memory_mb = memory_bytes / (1024 ** 2)
    memory_gb = memory_bytes / (1024 ** 3)

    return {
        "latent_shape": [batch_size, latent_channels, latent_t, latent_h, latent_w],
        "num_elements": num_elements,
        "memory_bytes": memory_bytes,
        "memory_mb": round(memory_mb, 2),
        "memory_gb": round(memory_gb, 4)
    }

Explanation

The VAE encoder compresses the video spatially by spatial_downsample in each dimension and temporally by temporal_downsample.
The latent tensor shape is [batch, channels, T/t_ds, H/s_ds, W/s_ds].
Multiply all dimensions together to get the total number of elements.
Multiply by bytes per element (2 for float16/bfloat16, 4 for float32) to get the memory footprint.
This latent tensor is what the diffusion model operates on; reducing its size dramatically reduces compute compared to pixel-space diffusion.

Complexity

Time: O(1)
Space: O(1)

← #445 #447 →