Why deeper layers should learn to do nothing

Introduction

Residual blocks make identity mapping easier by changing what the added layers have to learn. In a plain deep network, extra layers must actively recreate the input if they are not needed. In a residual block, the input already travels around the learned layers through a shortcut, so the learned branch only has to add a correction. If no correction is useful, that branch can move towards zero and the whole block behaves almost like “do nothing”. This is the practical reason residual learning helped with the degradation problem: deeper models became less risky because unnecessary depth no longer had to perfectly rebuild what was already there. The original ResNet paper framed this directly: if identity mapping is optimal, it should be easier to push the residual part to zero than to make stacked nonlinear layers learn identity from scratch. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…

Identity mapping illustration 1

The degradation problem was not just “too many parameters”

The puzzle behind residual networks was simple but important. If a shallower network works well, a deeper version should be able to match it by copying the shallower solution and letting the extra layers act as identity mappings. In theory, the deeper model contains the shallower model as a special case. In practice, plain deep networks often failed to find that special case during training, and training error could get worse as depth increased. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…

That matters because degradation is an optimisation problem, not merely an overfitting problem. If test accuracy falls because the model memorises the training data, that is one kind of failure. But if training error itself rises when more layers are added, the model is struggling to fit even the data it is trained on. Residual blocks were designed for exactly this situation: not to make every layer powerful, but to make harmless extra depth easier to represent.

The identity mapping case is the cleanest way to see the problem. A layer stack that should do nothing still contains weights, nonlinearities and normalisation steps. Learning “copy the input exactly” through that machinery can be surprisingly awkward. A residual block avoids making the learned path responsible for the copy.

Why zero change is easier than rebuilding the input

A residual block can be summarised as: output equals the input plus a learned change. The shortcut carries the input forward directly, while the residual branch learns only the difference between the desired output and that input. In ResNet notation, instead of forcing layers to learn a full mapping, the architecture asks them to learn a residual mapping relative to the input. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…

This changes the default behaviour of the block. In a plain stack, “do nothing” means the weights and activations must combine to reproduce the input. In a residual block, “do nothing” means the residual branch contributes approximately zero. That is a much simpler target for optimisation: suppress the correction rather than reconstruct the whole signal.

The distinction is small in notation but large in training. A plain layer stack has to preserve all useful information through every transformation. A residual block can preserve information through the shortcut and use the main branch only where change is beneficial. This makes identity mapping an accessible fallback rather than a fragile solution hidden somewhere in a complicated parameter space.

Identity mapping illustration 2

The shortcut makes depth a safer bet

Residual blocks made extra depth less dangerous because additional layers no longer had to justify themselves immediately. If a new block helps, its residual branch can learn a useful correction. If it does not help, the shortcut allows the signal to pass through with limited disturbance. This is why residual learning turned “add more layers” from a risky bet into a more manageable design choice.

Later work on identity mappings sharpened this idea. He, Zhang, Ren and Sun argued that forward and backward signals can propagate directly from one residual unit to another when both the skip connection and the after-addition pathway behave as identity mappings. Their ablation experiments supported the importance of keeping this path clean, leading to the pre-activation residual unit, where normalisation and activation are moved before the weight layers rather than after the addition. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

The lesson is not that every residual block should literally do nothing. It is that the network should not be punished for having blocks that temporarily or permanently make little change. Some blocks may learn strong transformations; others may behave closer to identity. Residual design gives the optimiser that flexibility.

Why implementation details matter

Identity mapping is easiest when the shortcut is genuinely simple. The strongest version is an unweighted identity shortcut: the block input is added directly to the residual branch output, with no extra parameters and no extra transformation. Educational treatments of ResNet often emphasise this point because the shortcut itself introduces little computational burden while preserving a clean route for information. [ApX Machine Learning]apxml.comApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesA typical residual block consists of a few stacked layers (e…

The pre-activation design goes further by reducing obstacles on the identity path. In the original post-activation layout, operations such as ReLU after the addition could still interfere with direct signal propagation. The pre-activation paper treated batch normalisation and ReLU as part of the residual branch instead, helping the main skip route remain closer to identity. This design was reported to make very deep networks, including 1001-layer ResNets on CIFAR benchmarks, easier to train and better at generalising than the earlier formulation. [arXiv]arxiv.orgarXiv:1603.05027v3 [cs.CV] 25 Jul 2016by K He · 2016 · Cited by 15428 — To construct an identity mapping f(yl) = yl, we view the act…

This is a useful implementation lesson for artificial intelligence systems more broadly: the abstract idea “add a shortcut” is not enough. Whether the shortcut is clean, scaled, gated, normalised or interrupted can change how easily a model preserves information.

Identity mapping illustration 3

What residual blocks do not magically solve

Residual blocks make identity mapping easier, but they do not remove every training difficulty. The residual branch still has to learn useful transformations, and very deep networks can still be sensitive to initialisation, normalisation, learning rate and architecture details. The identity route is a stabilising design choice, not a guarantee that any arbitrarily deep model will train well.

Research after ResNet also refined the story. A NeurIPS 2020 paper argued that batch normalisation helps deep residual networks partly because, at initialisation, it downscales the residual branch relative to the skip connection, making blocks close to identity on average. That supports the broader idea that trainable deep residual networks often begin from, or are biased towards, a safe near-identity behaviour before learning stronger changes. [arXiv]arxiv.orgOpen source on arxiv.org.

The practical takeaway is straightforward: residual blocks make deeper layers safer because they turn “learn to copy everything” into “learn only what needs to change”. That single shift explains why identity mapping became such a central mechanism in making deeper neural networks trainable.

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

eBay

Example eBay listing

Not with a Bug, But with a Sticker – Attacks on Machine Learning Systems and Wh…

Search eBay.co.uk: machine learning sticker

Browse similar on eBay.co.uk

Example eBay listing

Not with a Bug, but with a Sticker : Attacks on Machine Learning Systems and ...

Search eBay.co.uk: machine learning sticker

Browse similar on eBay.co.uk

Example eBay listing

MACHINE LEARNING MODEL SMALL STICKER DECAL SCHOOL COLLEGE TEACH TEACHING

Search eBay.co.uk: machine learning sticker

Browse similar on eBay.co.uk

Example eBay listing

Neural Network AI Machine Learning Diagram Sticker for Tech Geeks #5050

Search eBay.co.uk: machine learning sticker

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: cv-foundation.org
Title: He Deep Residual Learning CVPR 2016 paper
Link: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf
Source snippet
Deep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo...
Source: arxiv.org
Link: https://arxiv.org/pdf/1512.03385
Source snippet
arXiv:1512.03385v1 [cs.CV] 10 Dec 2015by K He · 2015 · Cited by 324974 — When deeper networks are able to start converging, a degrad...
Source: arxiv.org
Title: arXiv Identity Mappings in Deep Residual Networks
Link: https://arxiv.org/abs/1603.05027
Source snippet
Identity Mappings in Deep Residual NetworksMarch 16, 2016...

Published: March 16, 2016
Source: arxiv.org
Link: https://arxiv.org/pdf/1603.05027
Source snippet
arXiv:1603.05027v3 [cs.CV] 25 Jul 2016by K He · 2016 · Cited by 15428 — To construct an identity mapping f(yl) = yl, we view the act...
Source: arxiv.org
Link: https://arxiv.org/abs/2002.10444
Source: proceedings.neurips.cc
Title: e6b738eca0e6792ba8a9cbcba6c1881d Paper
Link: https://proceedings.neurips.cc/paper/2020/file/e6b738eca0e6792ba8a9cbcba6c1881d-Paper.pdf
Source: apxml.com
Link: https://apxml.com/courses/cnns-for-computer-vision/chapter-1-cnn-foundations-modern-architectures/residual-connections-skip-architectures
Source snippet
ApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesA typical residual block consists of a few stacked layers (e...

Additional References

Source: youtube.com
Title: The “Shortcut” That Made AI See Like a Human
Link: https://www.youtube.com/watch?v=GMoyIT8wG1E
Source snippet
Why ResNets Work So Well | [Deep Learning]({{ 'deep-learning/' | relative_url }}) Explained - YouTube Why ResNets Work So Well | Deep Learning Explained - YouTube Coursesteach...
Source: youtu.be
Link: https://youtu.be/O45AaRPQhuI
Source snippet
*Contents* ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 00:00 - Intro...
Source: youtube.com
Title: Why Deep Networks Fail & How Res Net Fixes It
Link: https://www.youtube.com/watch?v=M108HPERPc8
Source snippet
Deep Residual Learning for Image Recognition: The ResNet Paper Explained...
Source: d2l.ai
Link: https://d2l.ai/chapter_convolutional-modern/resnet.html
Source: youtube.com
Title: Why Res Nets Work So Well | Deep Learning Explained
Link: https://www.youtube.com/watch?v=rvzzbysndo4
Source snippet
Why Deep Networks Fail & How ResNet Fixes It...
Source: youtu.be
Link: https://youtu.be/Wss3WMy5OxE
Source snippet
"Why neural networks can learn any function: [https://youtu.be/O45AaRPQhuI..."](https://youtu.be/O45AaRPQhuI...")...
Source: youtube.com
Title: Deep Residual Learning for Image Recognition: The Res Net Paper Explained
Link: https://www.youtube.com/watch?v=EQSp5L1cOBE
Source snippet
Why Residual Connections (ResNet) Work...
Source: youtube.com
Title: Why Residual Connections (Res Net) Work
Link: https://www.youtube.com/watch?v=Gey9CG6R6w8
Source snippet
The "Shortcut" That Made AI See Like a Human...
Source: icml.cc
Title: icml2016 tutorial deep residual networks kaiminghe
Link: https://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf

Why deeper layers should learn to do nothing

Introduction

The degradation problem was not just “too many parameters”

Why zero change is easier than rebuilding the input

The shortcut makes depth a safer bet

Why implementation details matter

What residual blocks do not magically solve

Further Reading

Deep Learning

Dive into Deep Learning

Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...

Deep Learning with Python

Marketplace Samples

Not with a Bug, But with a Sticker – Attacks on Machine Learning Systems and Wh…

Not with a Bug, but with a Sticker : Attacks on Machine Learning Systems and ...

MACHINE LEARNING MODEL SMALL STICKER DECAL SCHOOL COLLEGE TEACH TEACHING

Neural Network AI Machine Learning Diagram Sticker for Tech Geeks #5050

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2