Within Beyond text

How Transformers learned to see in patches

Vision Transformers turned pictures into patch tokens so attention could connect distant parts of an object without relying only on local filters.

On this page

  • How an image becomes a sequence of patch tokens
  • Why attention helps connect distant visual regions
  • Where Vision Transformers still need spatial tricks
Preview for How Transformers learned to see in patches

Introduction

Transformers learned to process images by borrowing an idea from language: turn a complex input into a sequence of tokens. In a Vision Transformer (ViT), an image is divided into small square regions called patches, and each patch becomes a token that can interact with every other token through self-attention. This seemingly simple change is important because it allows the model to connect distant parts of the same object directly, rather than relying solely on layers of local filters to pass information across an image. The result is a system that can recognise relationships spanning large visual distances and build a coherent understanding of objects from scattered visual evidence. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Image patches illustration 1 Within the broader story of how Transformers expanded beyond text, image patches were the key adaptation that made visual data compatible with the attention mechanism. Once images could be represented as patch tokens, the same core Transformer architecture used for language could be applied to vision tasks. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

How an image becomes a sequence of patch tokens

A digital image is naturally arranged as a two-dimensional grid of pixels, whereas a Transformer expects a sequence. Vision Transformers bridge this gap by splitting an image into fixed-size patches, often 16×16 pixels. Each patch is flattened and converted into a numerical embedding, creating a sequence of visual tokens analogous to words in a sentence. [arXiv+2Dive into Deep Learning]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Consider a photograph containing a dog. Instead of processing millions of individual pixel values directly, the model receives dozens or hundreds of patch tokens. Some tokens may contain part of an ear, others a paw, fur texture, or background grass. The Transformer does not initially know which patches belong together. Its task is to learn those relationships through attention. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This patch representation serves two purposes:

  • It converts an image into a format the Transformer can process.
  • It reduces the sequence length compared with treating every pixel as a separate token, making computation practical. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

The choice of patches is partly an engineering compromise. Larger patches reduce computational cost but lose fine detail; smaller patches preserve more information but increase the number of tokens dramatically. Researchers continue to explore this trade-off, including approaches that use smaller regions or even individual pixels. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…Published: June 13, 2024

Why attention helps connect distant visual regions

The central reason patches help Transformers see objects is that self-attention allows any patch to compare itself with any other patch in the image. Unlike traditional convolutional networks, which primarily exchange information through local neighbourhoods, a Vision Transformer can establish long-range connections immediately. [arXiv+2ResearchGate]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Imagine a photograph of a bicycle. The front wheel may appear on one side of the image while the rear wheel appears far away. A convolutional network typically builds understanding gradually, combining local features layer by layer until distant regions influence one another. A Transformer can directly relate the two wheel patches through attention, even if they are separated by large portions of the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This ability becomes especially useful when:

  • An object spans a large area.
  • Parts of an object are separated by cluttered backgrounds.
  • Recognition depends on relationships between distant regions.
  • Multiple objects interact across the scene. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Researchers analysing Vision Transformers have observed that attention patterns often evolve from relatively local interactions in early layers towards broader image-wide relationships in deeper layers. This suggests that the model gradually assembles object-level understanding by linking increasingly distant visual evidence. [arXiv]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…Published: October 22, 2020

A useful way to think about attention is as a dynamic routing system. Instead of being forced to process neighbouring patches first, the model learns which regions matter to each other for the current task. Patches representing different parts of the same object can reinforce one another regardless of their position within the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Image patches illustration 2

Why patches are more than a computational shortcut

It might seem that patches exist only to reduce the number of tokens, but they also encourage the model to work with meaningful visual units rather than isolated pixels. A 16×16 patch often contains edges, textures, colours, or partial object components that provide richer information than single pixels alone. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This creates a useful intermediate representation. The model can learn that certain collections of patches frequently occur together as parts of cars, faces, animals, or buildings. Through repeated exposure to large datasets, attention learns patterns linking these pieces into larger structures. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Interestingly, newer research suggests that patches may not be strictly necessary in principle. Experiments have shown that Transformers can operate directly on individual pixels and still learn useful visual representations, although the computational cost becomes much higher. These findings imply that patches are not the source of visual understanding itself; rather, they provide an efficient way to make attention-based vision practical at scale. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…Published: June 13, 2024

Where Vision Transformers still need spatial tricks

A challenge arises because a Transformer does not inherently understand spatial layout. If patch tokens were presented without positional information, the model would treat them as an unordered collection. A patch from the top-left corner would be indistinguishable from one in the bottom-right. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…

To solve this problem, Vision Transformers add positional embeddings that encode where each patch came from. These signals provide the spatial context needed to reconstruct the image’s structure. Without them, attention could compare patches but would struggle to understand their arrangement. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…

Another limitation is that pure Transformers lack some of the built-in visual assumptions that make convolutional networks efficient learners. CNNs naturally exploit locality and translation-related patterns, whereas Vision Transformers must often learn these properties from data. As a result, early Vision Transformers typically required very large training datasets before matching or surpassing strong convolutional models. [arXiv+2NCBI]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…Published: October 22, 2020

This challenge has inspired many hybrid and hierarchical designs that reintroduce spatial structure while preserving the advantages of attention. Some models organise patches into regions, while others perform attention at multiple scales so that local detail and global context can be combined more efficiently. [arXiv+2arXiv]arxiv.orgarXiv Region Vi T: Regional-to-Local Attention for Vision TransformersRegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021…Published: June 4, 2021

Image patches illustration 3

The key idea behind object perception in Vision Transformers

Image patches give Transformers a way to convert pictures into token sequences, but attention is what turns those tokens into object understanding. By allowing every patch to communicate with every other patch, the model can assemble scattered visual clues into coherent objects and scenes. Rather than building understanding solely from local neighbourhoods outward, it can reason about global relationships from the start. [arXiv+2Hugging Face]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

That combination—patches as visual tokens and attention as a mechanism for linking them—is what enabled Transformers to move from language into computer vision and begin recognising objects in images. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Amazon book picks

Further Reading

Books and field guides related to How Transformers learned to see in patches. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Builds the foundational understanding needed for Vision Transformer architectures.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Title: arXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
    Link: https://arxiv.org/abs/2010.11929
    Source snippet

    An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020...

    Published: October 22, 2020

  2. Source: arxiv.org
    Link: https://arxiv.org/abs/2406.09415
    Source snippet

    An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024...

    Published: June 13, 2024

  3. Source: researchgate.net
    Title: Research Gate A Comparative Analysis of Convolutional Neural Network
    Link: https://www.researchgate.net/publication/381001655_A_Comparative_Analysis_of_Convolutional_Neural_Network_and_Vision_Transformer_Embeddings_on_a_Novel_Domain-Specific_Task
    Source snippet

    A Comparative Analysis of Convolutional Neural Network...May 30, 2024 — The purpose of our study was to compare the performance of embed...

    Published: May 30, 2024

  4. Source: arxiv.org
    Title: CNNs. In CNNs, locality, two-dimensional neighborhood
    Link: https://arxiv.org/pdf/2010.11929
    Source snippet

    arXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h...

    Published: October 22, 2020

  5. Source: ncbi.nlm.nih.gov
    Title: NCBITransformers and Visual Transformers
    Link: https://www.ncbi.nlm.nih.gov/books/NBK597474/
    Source snippet

    and Visual Transformers - NCBI23 Jul 2023 — Since CNNs were specifically created for vision tasks, their architecture includes spatial in...

  6. Source: arxiv.org
    Title: arXiv Region Vi T: Regional-to-Local Attention for Vision Transformers
    Link: https://arxiv.org/abs/2106.02689
    Source snippet

    RegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021...

    Published: June 4, 2021

  7. Source: arxiv.org
    Title: arXiv Vision Transformers with Hierarchical Attention
    Link: https://arxiv.org/abs/2106.03180

  8. Source: arxiv.org
    Link: https://arxiv.org/vc/arxiv/papers/2305/2305.09880v2.pdf
    Source snippet

    A survey of the Vision Transformers and its CNN-...by A Khan · Cited by 436 — In the realm of computer vision tasks, vision transformers...

  9. Source: arxiv.org
    Link: https://arxiv.org/pdf/2305.09880
    Source snippet

    A survey of the Vision Transformers and their CNN-...by A Khan · 2023 · Cited by 438 — In the realm of computer vision tasks, ViTs have...

  10. Source: researchgate.net
    Link: https://www.researchgate.net/publication/352016819_Not_All_Images_are_Worth_16x16_Words_Dynamic_Vision_Transformers_with_Adaptive_Sequence_Length
    Source snippet

    (PDF) Not All Images are Worth 16x16 Words: Dynamic...In this paper, we argue that every image has its own characteristics, and ideally...

  11. Source: d2l.ai
    Link: https://d2l.ai/chapter_attention-mechanisms-and-transformers/vision-transformer.html
    Source snippet

    Dive into Deep Learning11.8. Transformers for VisionIn this way, image patches can be treated similarly to tokens in text sequences by Tr...

  12. Source: kmsrogerkim.github.io
    Title: Roger’s Blog Vi T: AN IMAGE IS WORTH 16X16 WORDS
    Link: https://kmsrogerkim.github.io/ai/vit/
    Source snippet

    ViT: AN IMAGE IS WORTH 16X16 WORDS - Roger's Blog18 Aug 2025 — Just like how strings were tokenized and embedded, an image is now split i...

  13. Source: huggingface.co
    Title: when transformers invade computer vision
    Link: https://huggingface.co/blog/RDTvlokip/when-transformers-invade-computer-vision
    Source snippet

    ! 🖼️4 Nov 2025 — Inductive bias: locality, translation equivariance. ViT thinking 👁️. Global from start (attention across all patches); L...

  14. Source: iclr-blogposts.github.io
    Title: positional embedding
    Link: https://iclr-blogposts.github.io/2025/blog/positional-embedding/
    Source snippet

    ICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz...

  15. Source: cameronrwolfe.substack.com
    Title: vision transformers
    Link: https://cameronrwolfe.substack.com/p/vision-transformers
    Source snippet

    Transformers - by Cameron R. Wolfe, Ph.D.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [1]. Although the tra...

  16. Source: github.com
    Title: Original paper: An Image is Worth 16x16 Words: Transformers for Image
    Link: https://github.com/SkiddieAhn/Code-Vision-Transformer
    Source snippet

    SkiddieAhn/Code-Vision-Transformer: [ICLR 2021] An...Provide the PyTorch tutorial code for understanding ViT (Vision Transformer) model...

  17. Source: github.com
    Title: Vision Transformer
    Link: https://github.com/tahmid0007/VisionTransformer
    Source snippet

    tahmid0007/VisionTransformer: A...A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH...

  18. Source: patrick-llgc.github.io
    Title: Also, maybe we need inductive bias for vision tasks
    Link: https://patrick-llgc.github.io/Learning-[Deep-Learning
    Source snippet

    ViT: An Image is Worth 16x16 Words: Transformers for...Transformers lack some inductive biases inherent to CNNs, such as translation equ...

  19. Source: cgarbin.github.io
    Title: vision transformers properties
    Link: https://cgarbin.github.io/vision-transformers-properties/
    Source snippet

    Vision transformer propertiesJul 23, 2022 — This lack of inductive bias in the network architecture is a fundamental difference between t...

  20. Source: maurocomi.com
    Link: https://maurocomi.com/blog/vit.html
    Source snippet

    AX: Building a Vision Transformer from Scratch8 Apr 2025 — Attention Magic: Each embedding “looks” at all the other embeddings and decide...

Additional References

  1. Source: medium.com
    Link: https://medium.com/data-science/a-patch-is-more-than-16-16-pixels-699359211513
    Source snippet

    A Patch is More than 16*16 Pixels | by Mengliu ZhaoThe Vision Transformer (ViT) uses 16*16 size patches as input tokens. It all dates bac...

  2. Source: medium.com
    Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80
    Source snippet

    An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...

  3. Source: medium.com
    Link: https://medium.com/%40ozzgur.sanli/building-a-vision-transformer-vit-from-scratch-in-pytorch-a-data-driven-guide-a2e34a774488
    Source snippet

    Building a Vision Transformer (ViT) from Scratch in PyTorchWe will build a Vision Transformer from scratch, using the famous “An Image is...

  4. Source: sebastianraschka.com
    Link: https://sebastianraschka.com/books/ml-q-and-ai-chapters/ch13/
    Source snippet

    Chapter 13: Large Training Sets for Vision TransformersLike fully connected networks, ViT architecture (and transformer architecture in g...

  5. Source: openreview.net
    Link: https://openreview.net/forum?id=tjNf0L8QjR
    Source snippet

    An Image is Worth More Than 16x16 Patches: Exploring...by DK Nguyen · Cited by 83 — The paper studies the role of locality biases in Vis...

  6. Source: openreview.net
    Link: https://openreview.net/pdf?id=_WnAQKse_uK
    Source snippet

    ViTAE: Vision Transformer Advanced by Exploring Intrinsic...by Y Xu · Cited by 517 — In this paper, we re-design the transformer block b...

  7. Source: facebook.com
    Link: https://www.facebook.com/groups/DeepNetGroup/posts/1281346242258255/
    Source snippet

    Transformers for Image Recognition at ScaleHave a look at this great article explaining the Vit. ViT — An Image is worth 16x16 words: Tra...

  8. Source: medium.com
    Link: https://medium.com/%40ManishChablani/vision-transformer-vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-a4bd5c6f17a7
    Source snippet

    Vision Transformer (ViT) — AN IMAGE IS WORTH 16X16...Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficie...

  9. Source: pinecone.io
    Link: https://www.pinecone.io/learn/series/image-search/vision-transformers/
    Source snippet

    Vision Transformers (ViT) ExplainedTherefore, we create image patches and embed those as patch embeddings.... Dosovitskiy et al., An Ima...

  10. Source: scispace.com
    Link: https://scispace.com/papers/an-image-is-worth-16x16-words-transformers-for-image-v85s5ahlww

Topic Tree

Follow this branch

Parent topic

Beyond text Why did attention work beyond language?

Related pages 2