VLM Visual Token Count from Image Resolution and Patch Size

#442 · Deep Learning · Easy

Problem

Calculate the number of visual tokens produced by a Vision-Language Model (VLM) given an input image resolution and the patch size used by the vision encoder. Optionally account for any spatial downsampling applied after the vision encoder.

Solution

def vlm_visual_token_count(
    image_height: int,
    image_width: int,
    patch_size: int,
    downsample_factor: int = 1
) -> dict:
    patches_h = image_height // patch_size
    patches_w = image_width // patch_size
    total_patches = patches_h * patches_w

    effective_h = patches_h // downsample_factor
    effective_w = patches_w // downsample_factor
    visual_tokens = effective_h * effective_w

    return {
        "patches_h": patches_h,
        "patches_w": patches_w,
        "total_patches": total_patches,
        "visual_tokens": visual_tokens
    }

Explanation

The vision encoder (e.g., ViT) divides the image into non-overlapping patches of size patch_size x patch_size.
The number of patches along each dimension is dimension // patch_size.
Each patch produces one token embedding, so total patches equals the initial visual token count.
Some VLMs apply a spatial downsampling layer (e.g., 2x2 pooling) after the encoder to reduce the token count by downsample_factor^2.
The final visual token count is what gets concatenated with text tokens and fed to the language model.

Complexity

Time: O(1)
Space: O(1)

← #441 #443 →