← back

VLM Visual Token Count from Image Resolution and Patch Size

#442 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Calculate the number of visual tokens produced by a Vision-Language Model (VLM) given an input image resolution and the patch size used by the vision encoder. Optionally account for any spatial downsampling applied after the vision encoder.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def vlm_visual_token_count(
    image_height: int,
    image_width: int,
    patch_size: int,
    downsample_factor: int = 1
) -> dict:
    patches_h = image_height // patch_size
    patches_w = image_width // patch_size
    total_patches = patches_h * patches_w

    effective_h = patches_h // downsample_factor
    effective_w = patches_w // downsample_factor
    visual_tokens = effective_h * effective_w

    return {
        "patches_h": patches_h,
        "patches_w": patches_w,
        "total_patches": total_patches,
        "visual_tokens": visual_tokens
    }

Explanation

  1. The vision encoder (e.g., ViT) divides the image into non-overlapping patches of size patch_size x patch_size.
  2. The number of patches along each dimension is dimension // patch_size.
  3. Each patch produces one token embedding, so total patches equals the initial visual token count.
  4. Some VLMs apply a spatial downsampling layer (e.g., 2x2 pooling) after the encoder to reduce the token count by downsample_factor^2.
  5. The final visual token count is what gets concatenated with text tokens and fed to the language model.

Complexity

  • Time: O(1)
  • Space: O(1)