How Transformers learned to see in patches

Introduction

Transformers learned to process images by borrowing an idea from language: turn a complex input into a sequence of tokens. In a Vision Transformer (ViT), an image is divided into small square regions called patches, and each patch becomes a token that can interact with every other token through self-attention. This seemingly simple change is important because it allows the model to connect distant parts of the same object directly, rather than relying solely on layers of local filters to pass information across an image. The result is a system that can recognise relationships spanning large visual distances and build a coherent understanding of objects from scattered visual evidence. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Image patches illustration 1 Within the broader story of how Transformers expanded beyond text, image patches were the key adaptation that made visual data compatible with the attention mechanism. Once images could be represented as patch tokens, the same core Transformer architecture used for language could be applied to vision tasks. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

How an image becomes a sequence of patch tokens

A digital image is naturally arranged as a two-dimensional grid of pixels, whereas a Transformer expects a sequence. Vision Transformers bridge this gap by splitting an image into fixed-size patches, often 16×16 pixels. Each patch is flattened and converted into a numerical embedding, creating a sequence of visual tokens analogous to words in a sentence. [arXiv+2Dive into Deep Learning]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Consider a photograph containing a dog. Instead of processing millions of individual pixel values directly, the model receives dozens or hundreds of patch tokens. Some tokens may contain part of an ear, others a paw, fur texture, or background grass. The Transformer does not initially know which patches belong together. Its task is to learn those relationships through attention. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This patch representation serves two purposes:

It converts an image into a format the Transformer can process.
It reduces the sequence length compared with treating every pixel as a separate token, making computation practical. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

The choice of patches is partly an engineering compromise. Larger patches reduce computational cost but lose fine detail; smaller patches preserve more information but increase the number of tokens dramatically. Researchers continue to explore this trade-off, including approaches that use smaller regions or even individual pixels. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…Published: June 13, 2024

Why attention helps connect distant visual regions

The central reason patches help Transformers see objects is that self-attention allows any patch to compare itself with any other patch in the image. Unlike traditional convolutional networks, which primarily exchange information through local neighbourhoods, a Vision Transformer can establish long-range connections immediately. [arXiv+2ResearchGate]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Imagine a photograph of a bicycle. The front wheel may appear on one side of the image while the rear wheel appears far away. A convolutional network typically builds understanding gradually, combining local features layer by layer until distant regions influence one another. A Transformer can directly relate the two wheel patches through attention, even if they are separated by large portions of the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This ability becomes especially useful when:

An object spans a large area.
Parts of an object are separated by cluttered backgrounds.
Recognition depends on relationships between distant regions.
Multiple objects interact across the scene. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Researchers analysing Vision Transformers have observed that attention patterns often evolve from relatively local interactions in early layers towards broader image-wide relationships in deeper layers. This suggests that the model gradually assembles object-level understanding by linking increasingly distant visual evidence. [arXiv]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…Published: October 22, 2020

A useful way to think about attention is as a dynamic routing system. Instead of being forced to process neighbouring patches first, the model learns which regions matter to each other for the current task. Patches representing different parts of the same object can reinforce one another regardless of their position within the image. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Image patches illustration 2

Why patches are more than a computational shortcut

It might seem that patches exist only to reduce the number of tokens, but they also encourage the model to work with meaningful visual units rather than isolated pixels. A 16×16 patch often contains edges, textures, colours, or partial object components that provide richer information than single pixels alone. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

This creates a useful intermediate representation. The model can learn that certain collections of patches frequently occur together as parts of cars, faces, animals, or buildings. Through repeated exposure to large datasets, attention learns patterns linking these pieces into larger structures. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Interestingly, newer research suggests that patches may not be strictly necessary in principle. Experiments have shown that Transformers can operate directly on individual pixels and still learn useful visual representations, although the computational cost becomes much higher. These findings imply that patches are not the source of visual understanding itself; rather, they provide an efficient way to make attention-based vision practical at scale. [arXiv]arxiv.orgAn Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024…Published: June 13, 2024

Where Vision Transformers still need spatial tricks

A challenge arises because a Transformer does not inherently understand spatial layout. If patch tokens were presented without positional information, the model would treat them as an unordered collection. A patch from the top-left corner would be indistinguishable from one in the bottom-right. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…

To solve this problem, Vision Transformers add positional embeddings that encode where each patch came from. These signals provide the spatial context needed to reconstruct the image’s structure. Without them, attention could compare patches but would struggle to understand their arrangement. [ICLR Blog Posts]iclr-blogposts.github.iopositional embeddingICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz…

Another limitation is that pure Transformers lack some of the built-in visual assumptions that make convolutional networks efficient learners. CNNs naturally exploit locality and translation-related patterns, whereas Vision Transformers must often learn these properties from data. As a result, early Vision Transformers typically required very large training datasets before matching or surpassing strong convolutional models. [arXiv+2NCBI]arxiv.orgCNNs. In CNNs, locality, two-dimensional neighborhoodarXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h…Published: October 22, 2020

This challenge has inspired many hybrid and hierarchical designs that reintroduce spatial structure while preserving the advantages of attention. Some models organise patches into regions, while others perform attention at multiple scales so that local detail and global context can be combined more efficiently. [arXiv+2arXiv]arxiv.orgarXiv Region Vi T: Regional-to-Local Attention for Vision TransformersRegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021…Published: June 4, 2021

Image patches illustration 3

The key idea behind object perception in Vision Transformers

Image patches give Transformers a way to convert pictures into token sequences, but attention is what turns those tokens into object understanding. By allowing every patch to communicate with every other patch, the model can assemble scattered visual clues into coherent objects and scenes. Rather than building understanding solely from local neighbourhoods outward, it can reason about global relationships from the start. [arXiv+2Hugging Face]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

That combination—patches as visual tokens and attention as a mechanism for linking them—is what enabled Transformers to move from language into computer vision and begin recognising objects in images. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020…Published: October 22, 2020

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Machine Learning Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: machine learning wall art

Browse similar on eBay.co.uk

Example eBay listing

Anti AI Anti Machine Learning Say N Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: machine learning wall art

Browse similar on eBay.co.uk

Example eBay listing

Machine Learning Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: machine learning wall art

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Title: arXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Link: https://arxiv.org/abs/2010.11929
Source snippet
An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleOctober 22, 2020...

Published: October 22, 2020
Source: arxiv.org
Link: https://arxiv.org/abs/2406.09415
Source snippet
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual PixelsJune 13, 2024...

Published: June 13, 2024
Source: researchgate.net
Title: Research Gate A Comparative Analysis of Convolutional Neural Network
Link: https://www.researchgate.net/publication/381001655_A_Comparative_Analysis_of_Convolutional_Neural_Network_and_Vision_Transformer_Embeddings_on_a_Novel_Domain-Specific_Task
Source snippet
A Comparative Analysis of Convolutional Neural Network...May 30, 2024 — The purpose of our study was to compare the performance of embed...

Published: May 30, 2024
Source: arxiv.org
Title: CNNs. In CNNs, locality, two-dimensional neighborhood
Link: https://arxiv.org/pdf/2010.11929
Source snippet
arXiv:2010.11929v2 [cs.CV] 3 Jun 2021October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 94276 — We note that Vision Transformer h...

Published: October 22, 2020
Source: ncbi.nlm.nih.gov
Title: NCBITransformers and Visual Transformers
Link: https://www.ncbi.nlm.nih.gov/books/NBK597474/
Source snippet
and Visual Transformers - NCBI23 Jul 2023 — Since CNNs were specifically created for vision tasks, their architecture includes spatial in...
Source: arxiv.org
Title: arXiv Region Vi T: Regional-to-Local Attention for Vision Transformers
Link: https://arxiv.org/abs/2106.02689
Source snippet
RegionViT: Regional-to-Local Attention for Vision TransformersJune 4, 2021...

Published: June 4, 2021
Source: arxiv.org
Title: arXiv Vision Transformers with Hierarchical Attention
Link: https://arxiv.org/abs/2106.03180
Source: arxiv.org
Link: https://arxiv.org/vc/arxiv/papers/2305/2305.09880v2.pdf
Source snippet
A survey of the Vision Transformers and its CNN-...by A Khan · Cited by 436 — In the realm of computer vision tasks, vision transformers...
Source: arxiv.org
Link: https://arxiv.org/pdf/2305.09880
Source snippet
A survey of the Vision Transformers and their CNN-...by A Khan · 2023 · Cited by 438 — In the realm of computer vision tasks, ViTs have...
Source: researchgate.net
Link: https://www.researchgate.net/publication/352016819_Not_All_Images_are_Worth_16x16_Words_Dynamic_Vision_Transformers_with_Adaptive_Sequence_Length
Source snippet
(PDF) Not All Images are Worth 16x16 Words: Dynamic...In this paper, we argue that every image has its own characteristics, and ideally...
Source: d2l.ai
Link: https://d2l.ai/chapter_attention-mechanisms-and-transformers/vision-transformer.html
Source snippet
Dive into Deep Learning11.8. Transformers for VisionIn this way, image patches can be treated similarly to tokens in text sequences by Tr...
Source: kmsrogerkim.github.io
Title: Roger’s Blog Vi T: AN IMAGE IS WORTH 16X16 WORDS
Link: https://kmsrogerkim.github.io/ai/vit/
Source snippet
ViT: AN IMAGE IS WORTH 16X16 WORDS - Roger's Blog18 Aug 2025 — Just like how strings were tokenized and embedded, an image is now split i...
Source: huggingface.co
Title: when transformers invade computer vision
Link: https://huggingface.co/blog/RDTvlokip/when-transformers-invade-computer-vision
Source snippet
! 🖼️4 Nov 2025 — Inductive bias: locality, translation equivariance. ViT thinking 👁️. Global from start (attention across all patches); L...
Source: iclr-blogposts.github.io
Title: positional embedding
Link: https://iclr-blogposts.github.io/2025/blog/positional-embedding/
Source snippet
ICLR Blog PostsPositional Embeddings in Transformer ModelsApr 28, 2025 — This blog post examines positional encoding techniques, emphasiz...
Source: cameronrwolfe.substack.com
Title: vision transformers
Link: https://cameronrwolfe.substack.com/p/vision-transformers
Source snippet
Transformers - by Cameron R. Wolfe, Ph.D.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [1]. Although the tra...
Source: github.com
Title: Original paper: An Image is Worth 16x16 Words: Transformers for Image
Link: https://github.com/SkiddieAhn/Code-Vision-Transformer
Source snippet
SkiddieAhn/Code-Vision-Transformer: [ICLR 2021] An...Provide the PyTorch tutorial code for understanding ViT (Vision Transformer) model...
Source: github.com
Title: Vision Transformer
Link: https://github.com/tahmid0007/VisionTransformer
Source snippet
tahmid0007/VisionTransformer: A...A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH...
Source: patrick-llgc.github.io
Title: Also, maybe we need inductive bias for vision tasks
Link: https://patrick-llgc.github.io/Learning-[Deep-Learning
Source snippet
ViT: An Image is Worth 16x16 Words: Transformers for...Transformers lack some inductive biases inherent to CNNs, such as translation equ...
Source: cgarbin.github.io
Title: vision transformers properties
Link: https://cgarbin.github.io/vision-transformers-properties/
Source snippet
Vision transformer propertiesJul 23, 2022 — This lack of inductive bias in the network architecture is a fundamental difference between t...
Source: maurocomi.com
Link: https://maurocomi.com/blog/vit.html
Source snippet
AX: Building a Vision Transformer from Scratch8 Apr 2025 — Attention Magic: Each embedding “looks” at all the other embeddings and decide...

Additional References

Source: medium.com
Link: https://medium.com/data-science/a-patch-is-more-than-16-16-pixels-699359211513
Source snippet
A Patch is More than 16*16 Pixels | by Mengliu ZhaoThe Vision Transformer (ViT) uses 16*16 size patches as input tokens. It all dates bac...
Source: medium.com
Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80
Source snippet
An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...
Source: medium.com
Link: https://medium.com/%40ozzgur.sanli/building-a-vision-transformer-vit-from-scratch-in-pytorch-a-data-driven-guide-a2e34a774488
Source snippet
Building a Vision Transformer (ViT) from Scratch in PyTorchWe will build a Vision Transformer from scratch, using the famous “An Image is...
Source: sebastianraschka.com
Link: https://sebastianraschka.com/books/ml-q-and-ai-chapters/ch13/
Source snippet
Chapter 13: Large Training Sets for Vision TransformersLike fully connected networks, ViT architecture (and transformer architecture in g...
Source: openreview.net
Link: https://openreview.net/forum?id=tjNf0L8QjR
Source snippet
An Image is Worth More Than 16x16 Patches: Exploring...by DK Nguyen · Cited by 83 — The paper studies the role of locality biases in Vis...
Source: openreview.net
Link: https://openreview.net/pdf?id=_WnAQKse_uK
Source snippet
ViTAE: Vision Transformer Advanced by Exploring Intrinsic...by Y Xu · Cited by 517 — In this paper, we re-design the transformer block b...
Source: facebook.com
Link: https://www.facebook.com/groups/DeepNetGroup/posts/1281346242258255/
Source snippet
Transformers for Image Recognition at ScaleHave a look at this great article explaining the Vit. ViT — An Image is worth 16x16 words: Tra...
Source: medium.com
Link: https://medium.com/%40ManishChablani/vision-transformer-vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-a4bd5c6f17a7
Source snippet
Vision Transformer (ViT) — AN IMAGE IS WORTH 16X16...Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficie...
Source: pinecone.io
Link: https://www.pinecone.io/learn/series/image-search/vision-transformers/
Source snippet
Vision Transformers (ViT) ExplainedTherefore, we create image patches and embed those as patch embeddings.... Dosovitskiy et al., An Ima...
Source: scispace.com
Link: https://scispace.com/papers/an-image-is-worth-16x16-words-transformers-for-image-v85s5ahlww

How Transformers learned to see in patches

Introduction

How an image becomes a sequence of patch tokens

Why attention helps connect distant visual regions

Why patches are more than a computational shortcut

Where Vision Transformers still need spatial tricks

The key idea behind object perception in Vision Transformers

Further Reading

Dive into Deep Learning

Transformers for Natural Language Processing and Computer Vision

Deep Learning

Computer Vision

Marketplace Samples

Machine Learning Framed Wall Art Poster Canvas Print Picture

Anti AI Anti Machine Learning Say N Framed Wall Art Poster Canvas Print Picture

Machine Learning Framed Wall Art Poster Canvas Print Picture

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2