Within Beyond text
How Transformers learned to see in patches
Vision Transformers turned pictures into patch tokens so attention could connect distant parts of an object without relying only on local filters.
On this page
- How an image becomes a sequence of patch tokens
- Why attention helps connect distant visual regions
- Where Vision Transformers still need spatial tricks
Page outline Jump by section
Introduction
Transformers learned to process images by borrowing an idea from language: turn a complex input into a sequence of tokens. In a Vision Transformer (ViT), an image is divided into small square regions called patches, and each patch becomes a token that can interact with every other token through self-attention. This seemingly simple change is important because it allows the model to connect distant parts of the same object directly, rather than relying solely on layers of local filters to pass information across an image. The result is a system that can recognise relationships spanning large visual distances and build a coherent understanding of objects from scattered visual evidence. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Within the broader story of how Transformers expanded beyond text, image patches were the key adaptation that made visual data compatible with the attention mechanism. Once images could be represented as patch tokens, the same core Transformer architecture used for language could be applied to vision tasks. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
How an image becomes a sequence of patch tokens
A digital image is naturally arranged as a two-dimensional grid of pixels, whereas a Transformer expects a sequence. Vision Transformers bridge this gap by splitting an image into fixed-size patches, often 16×16 pixels. Each patch is flattened and converted into a numerical embedding, creating a sequence of visual tokens analogous to words in a sentence. [arXiv+2Dive into Deep Learning]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Consider a photograph containing a dog. Instead of processing millions of individual pixel values directly, the model receives dozens or hundreds of patch tokens. Some tokens may contain part of an ear, others a paw, fur texture, or background grass. The Transformer does not initially know which patches belong together. Its task is to learn those relationships through attention. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
This patch representation serves two purposes:
- It converts an image into a format the Transformer can process.
- It reduces the sequence length compared with treating every pixel as a separate token, making computation practical. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
The choice of patches is partly an engineering compromise. Larger patches reduce computational cost but lose fine detail; smaller patches preserve more information but increase the number of tokens dramatically. Researchers continue to explore this trade-off, including approaches that use smaller regions or even individual pixels. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…
Why attention helps connect distant visual regions
The central reason patches help Transformers see objects is that self-attention allows any patch to compare itself with any other patch in the image. Unlike traditional convolutional networks, which primarily exchange information through local neighbourhoods, a Vision Transformer can establish long-range connections immediately. [arXiv+2ResearchGate]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Imagine a photograph of a bicycle. The front wheel may appear on one side of the image while the rear wheel appears far away. A convolutional network typically builds understanding gradually, combining local features layer by layer until distant regions influence one another. A Transformer can directly relate the two wheel patches through attention, even if they are separated by large portions of the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
This ability becomes especially useful when:
- An object spans a large area.
- Parts of an object are separated by cluttered backgrounds.
- Recognition depends on relationships between distant regions.
- Multiple objects interact across the scene. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Researchers analysing Vision Transformers have observed that attention patterns often evolve from relatively local interactions in early layers towards broader image-wide relationships in deeper layers. This suggests that the model gradually assembles object-level understanding by linking increasingly distant visual evidence. [arXiv]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…
A useful way to think about attention is as a dynamic routing system. Instead of being forced to process neighbouring patches first, the model learns which regions matter to each other for the current task. Patches representing different parts of the same object can reinforce one another regardless of their position within the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Why patches are more than a computational shortcut
It might seem that patches exist only to reduce the number of tokens, but they also encourage the model to work with meaningful visual units rather than isolated pixels. A 16×16 patch often contains edges, textures, colours, or partial object components that provide richer information than single pixels alone. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
This creates a useful intermediate representation. The model can learn that certain collections of patches frequently occur together as parts of cars, faces, animals, or buildings. Through repeated exposure to large datasets, attention learns patterns linking these pieces into larger structures. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Interestingly, newer research suggests that patches may not be strictly necessary in principle. Experiments have shown that Transformers can operate directly on individual pixels and still learn useful visual representations, although the computational cost becomes much higher. These findings imply that patches are not the source of visual understanding itself; rather, they provide an efficient way to make attention-based vision practical at scale. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…
Where Vision Transformers still need spatial tricks
A challenge arises because a Transformer does not inherently understand spatial layout. If patch tokens were presented without positional information, the model would treat them as an unordered collection. A patch from the top-left corner would be indistinguishable from one in the bottom-right. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…
To solve this problem, Vision Transformers add positional embeddings that encode where each patch came from. These signals provide the spatial context needed to reconstruct the image’s structure. Without them, attention could compare patches but would struggle to understand their arrangement. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…
Another limitation is that pure Transformers lack some of the built-in visual assumptions that make convolutional networks efficient learners. CNNs naturally exploit locality and translation-related patterns, whereas Vision Transformers must often learn these properties from data. As a result, early Vision Transformers typically required very large training datasets before matching or surpassing strong convolutional models. [arXiv+2NCBI]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…
This challenge has inspired many hybrid and hierarchical designs that reintroduce spatial structure while preserving the advantages of attention. Some models organise patches into regions, while others perform attention at multiple scales so that local detail and global context can be combined more efficiently. [arXiv+2arXiv]arxiv.orgarXiv Region Vi T: Regional-to-Local Attention for Vision TransformersRegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021…
The key idea behind object perception in Vision Transformers
Image patches give Transformers a way to convert pictures into token sequences, but attention is what turns those tokens into object understanding. By allowing every patch to communicate with every other patch, the model can assemble scattered visual clues into coherent objects and scenes. Rather than building understanding solely from local neighbourhoods outward, it can reason about global relationships from the start. [arXiv+2Hugging Face]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
That combination—patches as visual tokens and attention as a mechanism for linking them—is what enabled Transformers to move from language into computer vision and begin recognising objects in images. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…
Amazon book picks
Further Reading
Books and field guides related to How Transformers learned to see in patches. Use these as the next step if you want deeper reading beyond the article.
Dive into Deep Learning
Includes modern computer-vision and transformer concepts in an accessible format.
Transformers for Natural Language Processing and Computer Vision
Covers Vision Transformers and the adaptation of attention models to images.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Builds the foundational understanding needed for Vision Transformer architectures.
Computer Vision
Provides the vision background that Vision Transformers seek to improve upon.
Endnotes
-
Source: arxiv.org
Title: arXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Link: https://arxiv.org/abs/2010.11929Source snippet
An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020...
Published: October 22, 2020
-
Source: arxiv.org
Link: https://arxiv.org/abs/2406.09415Source snippet
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024...
Published: June 13, 2024
-
Source: researchgate.net
Title: Research Gate A Comparative Analysis of Convolutional Neural Network
Link: https://www.researchgate.net/publication/381001655_A_Comparative_Analysis_of_Convolutional_Neural_Network_and_Vision_Transformer_Embeddings_on_a_Novel_Domain-Specific_TaskSource snippet
A Comparative Analysis of Convolutional Neural Network...May 30, 2024 — The purpose of our study was to compare the performance of embed...
Published: May 30, 2024
-
Source: arxiv.org
Title: CNNs. In CNNs, locality, two-dimensional neighborhood
Link: https://arxiv.org/pdf/2010.11929Source snippet
arXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h...
Published: October 22, 2020
-
Source: ncbi.nlm.nih.gov
Title: NCBITransformers and Visual Transformers
Link: https://www.ncbi.nlm.nih.gov/books/NBK597474/Source snippet
and Visual Transformers - NCBI23 Jul 2023 — Since CNNs were specifically created for vision tasks, their architecture includes spatial in...
-
Source: arxiv.org
Title: arXiv Region Vi T: Regional-to-Local Attention for Vision Transformers
Link: https://arxiv.org/abs/2106.02689Source snippet
RegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021...
Published: June 4, 2021
-
Source: arxiv.org
Title: arXiv Vision Transformers with Hierarchical Attention
Link: https://arxiv.org/abs/2106.03180 -
Source: arxiv.org
Link: https://arxiv.org/vc/arxiv/papers/2305/2305.09880v2.pdfSource snippet
A survey of the Vision Transformers and its CNN-...by A Khan · Cited by 436 — In the realm of computer vision tasks, vision transformers...
-
Source: arxiv.org
Link: https://arxiv.org/pdf/2305.09880Source snippet
A survey of the Vision Transformers and their CNN-...by A Khan · 2023 · Cited by 438 — In the realm of computer vision tasks, ViTs have...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/352016819_Not_All_Images_are_Worth_16x16_Words_Dynamic_Vision_Transformers_with_Adaptive_Sequence_LengthSource snippet
(PDF) Not All Images are Worth 16x16 Words: Dynamic...In this paper, we argue that every image has its own characteristics, and ideally...
-
Source: d2l.ai
Link: https://d2l.ai/chapter_attention-mechanisms-and-transformers/vision-transformer.htmlSource snippet
Dive into Deep Learning11.8. Transformers for VisionIn this way, image patches can be treated similarly to tokens in text sequences by Tr...
-
Source: kmsrogerkim.github.io
Title: Roger’s Blog Vi T: AN IMAGE IS WORTH 16X16 WORDS
Link: https://kmsrogerkim.github.io/ai/vit/Source snippet
ViT: AN IMAGE IS WORTH 16X16 WORDS - Roger's Blog18 Aug 2025 — Just like how strings were tokenized and embedded, an image is now split i...
-
Source: huggingface.co
Title: when transformers invade computer vision
Link: https://huggingface.co/blog/RDTvlokip/when-transformers-invade-computer-visionSource snippet
! 🖼️4 Nov 2025 — Inductive bias: locality, translation equivariance. ViT thinking 👁️. Global from start (attention across all patches); L...
-
Source: iclr-blogposts.github.io
Title: positional embedding
Link: https://iclr-blogposts.github.io/2025/blog/positional-embedding/Source snippet
ICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz...
-
Source: cameronrwolfe.substack.com
Title: vision transformers
Link: https://cameronrwolfe.substack.com/p/vision-transformersSource snippet
Transformers - by Cameron R. Wolfe, Ph.D.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [1]. Although the tra...
-
Source: github.com
Title: Original paper: An Image is Worth 16x16 Words: Transformers for Image
Link: https://github.com/SkiddieAhn/Code-Vision-TransformerSource snippet
SkiddieAhn/Code-Vision-Transformer: [ICLR 2021] An...Provide the PyTorch tutorial code for understanding ViT (Vision Transformer) model...
-
Source: github.com
Title: Vision Transformer
Link: https://github.com/tahmid0007/VisionTransformerSource snippet
tahmid0007/VisionTransformer: A...A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH...
-
Source: patrick-llgc.github.io
Title: Also, maybe we need inductive bias for vision tasks
Link: https://patrick-llgc.github.io/Learning-[Deep-LearningSource snippet
ViT: An Image is Worth 16x16 Words: Transformers for...Transformers lack some inductive biases inherent to CNNs, such as translation equ...
-
Source: cgarbin.github.io
Title: vision transformers properties
Link: https://cgarbin.github.io/vision-transformers-properties/Source snippet
Vision transformer propertiesJul 23, 2022 — This lack of inductive bias in the network architecture is a fundamental difference between t...
-
Source: maurocomi.com
Link: https://maurocomi.com/blog/vit.htmlSource snippet
AX: Building a Vision Transformer from Scratch8 Apr 2025 — Attention Magic: Each embedding “looks” at all the other embeddings and decide...
Additional References
-
Source: medium.com
Link: https://medium.com/data-science/a-patch-is-more-than-16-16-pixels-699359211513Source snippet
A Patch is More than 16*16 Pixels | by Mengliu ZhaoThe Vision Transformer (ViT) uses 16*16 size patches as input tokens. It all dates bac...
-
Source: medium.com
Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80Source snippet
An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...
-
Source: medium.com
Link: https://medium.com/%40ozzgur.sanli/building-a-vision-transformer-vit-from-scratch-in-pytorch-a-data-driven-guide-a2e34a774488Source snippet
Building a Vision Transformer (ViT) from Scratch in PyTorchWe will build a Vision Transformer from scratch, using the famous “An Image is...
-
Source: sebastianraschka.com
Link: https://sebastianraschka.com/books/ml-q-and-ai-chapters/ch13/Source snippet
Chapter 13: Large Training Sets for Vision TransformersLike fully connected networks, ViT architecture (and transformer architecture in g...
-
Source: openreview.net
Link: https://openreview.net/forum?id=tjNf0L8QjRSource snippet
An Image is Worth More Than 16x16 Patches: Exploring...by DK Nguyen · Cited by 83 — The paper studies the role of locality biases in Vis...
-
Source: openreview.net
Link: https://openreview.net/pdf?id=_WnAQKse_uKSource snippet
ViTAE: Vision Transformer Advanced by Exploring Intrinsic...by Y Xu · Cited by 517 — In this paper, we re-design the transformer block b...
-
Source: facebook.com
Link: https://www.facebook.com/groups/DeepNetGroup/posts/1281346242258255/Source snippet
Transformers for Image Recognition at ScaleHave a look at this great article explaining the Vit. ViT — An Image is worth 16x16 words: Tra...
-
Source: medium.com
Link: https://medium.com/%40ManishChablani/vision-transformer-vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-a4bd5c6f17a7Source snippet
Vision Transformer (ViT) — AN IMAGE IS WORTH 16X16...Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficie...
-
Source: pinecone.io
Link: https://www.pinecone.io/learn/series/image-search/vision-transformers/Source snippet
Vision Transformers (ViT) ExplainedTherefore, we create image patches and embed those as patch embeddings.... Dosovitskiy et al., An Ima...
-
Source: scispace.com
Link: https://scispace.com/papers/an-image-is-worth-16x16-words-transformers-for-image-v85s5ahlww
Topic Tree


