#446 · Deep Learning · Medium
⊣ Solve on deep-ml.comEstimate the GPU memory required to store the latent space representation during video generation. Given the video resolution, number of frames, the spatial and temporal downsampling factors of the VAE, the number of latent channels, and the data type, compute the latent tensor shape and its memory footprint.
def video_latent_memory(
height: int,
width: int,
num_frames: int,
spatial_downsample: int,
temporal_downsample: int,
latent_channels: int,
bytes_per_element: int = 2,
batch_size: int = 1
) -> dict:
latent_h = height // spatial_downsample
latent_w = width // spatial_downsample
latent_t = num_frames // temporal_downsample
num_elements = batch_size * latent_channels * latent_t * latent_h * latent_w
memory_bytes = num_elements * bytes_per_element
memory_mb = memory_bytes / (1024 ** 2)
memory_gb = memory_bytes / (1024 ** 3)
return {
"latent_shape": [batch_size, latent_channels, latent_t, latent_h, latent_w],
"num_elements": num_elements,
"memory_bytes": memory_bytes,
"memory_mb": round(memory_mb, 2),
"memory_gb": round(memory_gb, 4)
}spatial_downsample in each dimension and temporally by temporal_downsample.[batch, channels, T/t_ds, H/s_ds, W/s_ds].