Within Skip links
Why deeper layers should learn to do nothing
Residual blocks make it simpler for extra layers to do nothing when no improvement is needed, reducing the degradation problem.
On this page
- The degradation problem in deeper plain networks
- Why learning zero change is easier than rebuilding input
- How residual blocks turn depth into a safer bet
Page outline Jump by section
Introduction
Residual blocks make identity mapping easier by changing what the added layers have to learn. In a plain deep network, extra layers must actively recreate the input if they are not needed. In a residual block, the input already travels around the learned layers through a shortcut, so the learned branch only has to add a correction. If no correction is useful, that branch can move towards zero and the whole block behaves almost like “do nothing”. This is the practical reason residual learning helped with the degradation problem: deeper models became less risky because unnecessary depth no longer had to perfectly rebuild what was already there. The original ResNet paper framed this directly: if identity mapping is optimal, it should be easier to push the residual part to zero than to make stacked nonlinear layers learn identity from scratch. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…
The degradation problem was not just “too many parameters”
The puzzle behind residual networks was simple but important. If a shallower network works well, a deeper version should be able to match it by copying the shallower solution and letting the extra layers act as identity mappings. In theory, the deeper model contains the shallower model as a special case. In practice, plain deep networks often failed to find that special case during training, and training error could get worse as depth increased. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…
That matters because degradation is an optimisation problem, not merely an overfitting problem. If test accuracy falls because the model memorises the training data, that is one kind of failure. But if training error itself rises when more layers are added, the model is struggling to fit even the data it is trained on. Residual blocks were designed for exactly this situation: not to make every layer powerful, but to make harmless extra depth easier to represent.
The identity mapping case is the cleanest way to see the problem. A layer stack that should do nothing still contains weights, nonlinearities and normalisation steps. Learning “copy the input exactly” through that machinery can be surprisingly awkward. A residual block avoids making the learned path responsible for the copy.
Why zero change is easier than rebuilding the input
A residual block can be summarised as: output equals the input plus a learned change. The shortcut carries the input forward directly, while the residual branch learns only the difference between the desired output and that input. In ResNet notation, instead of forcing layers to learn a full mapping, the architecture asks them to learn a residual mapping relative to the input. [cv-foundation.org]cv-foundation.orgHe Deep Residual Learning CVPR 2016 paperDeep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo…
This changes the default behaviour of the block. In a plain stack, “do nothing” means the weights and activations must combine to reproduce the input. In a residual block, “do nothing” means the residual branch contributes approximately zero. That is a much simpler target for optimisation: suppress the correction rather than reconstruct the whole signal.
The distinction is small in notation but large in training. A plain layer stack has to preserve all useful information through every transformation. A residual block can preserve information through the shortcut and use the main branch only where change is beneficial. This makes identity mapping an accessible fallback rather than a fragile solution hidden somewhere in a complicated parameter space.
The shortcut makes depth a safer bet
Residual blocks made extra depth less dangerous because additional layers no longer had to justify themselves immediately. If a new block helps, its residual branch can learn a useful correction. If it does not help, the shortcut allows the signal to pass through with limited disturbance. This is why residual learning turned “add more layers” from a risky bet into a more manageable design choice.
Later work on identity mappings sharpened this idea. He, Zhang, Ren and Sun argued that forward and backward signals can propagate directly from one residual unit to another when both the skip connection and the after-addition pathway behave as identity mappings. Their ablation experiments supported the importance of keeping this path clean, leading to the pre-activation residual unit, where normalisation and activation are moved before the weight layers rather than after the addition. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
The lesson is not that every residual block should literally do nothing. It is that the network should not be punished for having blocks that temporarily or permanently make little change. Some blocks may learn strong transformations; others may behave closer to identity. Residual design gives the optimiser that flexibility.
Why implementation details matter
Identity mapping is easiest when the shortcut is genuinely simple. The strongest version is an unweighted identity shortcut: the block input is added directly to the residual branch output, with no extra parameters and no extra transformation. Educational treatments of ResNet often emphasise this point because the shortcut itself introduces little computational burden while preserving a clean route for information. [ApX Machine Learning]apxml.comApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesA typical residual block consists of a few stacked layers (e…
The pre-activation design goes further by reducing obstacles on the identity path. In the original post-activation layout, operations such as ReLU after the addition could still interfere with direct signal propagation. The pre-activation paper treated batch normalisation and ReLU as part of the residual branch instead, helping the main skip route remain closer to identity. This design was reported to make very deep networks, including 1001-layer ResNets on CIFAR benchmarks, easier to train and better at generalising than the earlier formulation. [arXiv]arxiv.orgarXiv:1603.05027v3 [cs.CV] 25 Jul 2016by K He · 2016 · Cited by 15428 — To construct an identity mapping f(yl) = yl, we view the act…
This is a useful implementation lesson for artificial intelligence systems more broadly: the abstract idea “add a shortcut” is not enough. Whether the shortcut is clean, scaled, gated, normalised or interrupted can change how easily a model preserves information.
What residual blocks do not magically solve
Residual blocks make identity mapping easier, but they do not remove every training difficulty. The residual branch still has to learn useful transformations, and very deep networks can still be sensitive to initialisation, normalisation, learning rate and architecture details. The identity route is a stabilising design choice, not a guarantee that any arbitrarily deep model will train well.
Research after ResNet also refined the story. A NeurIPS 2020 paper argued that batch normalisation helps deep residual networks partly because, at initialisation, it downscales the residual branch relative to the skip connection, making blocks close to identity on average. That supports the broader idea that trainable deep residual networks often begin from, or are biased towards, a safe near-identity behaviour before learning stronger changes. [arXiv]arxiv.orgOpen source on arxiv.org.
The practical takeaway is straightforward: residual blocks make deeper layers safer because they turn “learn to copy everything” into “learn only what needs to change”. That single shift explains why identity mapping became such a central mechanism in making deeper neural networks trainable.
Amazon book picks
Further Reading
Books and field guides related to Why deeper layers should learn to do nothing. Use these as the next step if you want deeper reading beyond the article.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Provides the theoretical background for residual learning ideas.
Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...
Explains why architectural shortcuts improve training.
Deep Learning with Python
Helps readers understand residual-style improvements conceptually.
eBay marketplace picks
Marketplace Samples
Example marketplace items related to this page. Use the search link to explore similar finds on eBay.
Endnotes
-
Source: cv-foundation.org
Title: He Deep Residual Learning CVPR 2016 paper
Link: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdfSource snippet
Deep Residual Learning for Image Recognitionby K He · 2016 · Cited by 324974 — To the extreme, if an identity mapping were optimal, it wo...
-
Source: arxiv.org
Link: https://arxiv.org/pdf/1512.03385Source snippet
arXiv:1512.03385v1 [cs.CV] 10 Dec 2015by K He · 2015 · Cited by 324974 — When deeper networks are able to start converging, a degrad...
-
Source: arxiv.org
Title: arXiv Identity Mappings in Deep Residual Networks
Link: https://arxiv.org/abs/1603.05027Source snippet
Identity Mappings in Deep Residual NetworksMarch 16, 2016...
Published: March 16, 2016
-
Source: arxiv.org
Link: https://arxiv.org/pdf/1603.05027Source snippet
arXiv:1603.05027v3 [cs.CV] 25 Jul 2016by K He · 2016 · Cited by 15428 — To construct an identity mapping f(yl) = yl, we view the act...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2002.10444 -
Source: proceedings.neurips.cc
Title: e6b738eca0e6792ba8a9cbcba6c1881d Paper
Link: https://proceedings.neurips.cc/paper/2020/file/e6b738eca0e6792ba8a9cbcba6c1881d-Paper.pdf -
Source: apxml.com
Link: https://apxml.com/courses/cnns-for-computer-vision/chapter-1-cnn-foundations-modern-architectures/residual-connections-skip-architecturesSource snippet
ApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesA typical residual block consists of a few stacked layers (e...
Additional References
-
Source: youtube.com
Title: The “Shortcut” That Made AI See Like a Human
Link: https://www.youtube.com/watch?v=GMoyIT8wG1ESource snippet
Why ResNets Work So Well | [Deep Learning]({{ 'deep-learning/' | relative_url }}) Explained - YouTube Why ResNets Work So Well | Deep Learning Explained - YouTube Coursesteach...
-
Source: youtu.be
Link: https://youtu.be/O45AaRPQhuISource snippet
*Contents* ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ 00:00 - Intro...
-
Source: youtube.com
Title: Why Deep Networks Fail & How Res Net Fixes It
Link: https://www.youtube.com/watch?v=M108HPERPc8Source snippet
Deep Residual Learning for Image Recognition: The ResNet Paper Explained...
-
Source: d2l.ai
Link: https://d2l.ai/chapter_convolutional-modern/resnet.html -
Source: youtube.com
Title: Why Res Nets Work So Well | Deep Learning Explained
Link: https://www.youtube.com/watch?v=rvzzbysndo4Source snippet
Why Deep Networks Fail & How ResNet Fixes It...
-
Source: youtu.be
Link: https://youtu.be/Wss3WMy5OxESource snippet
"Why neural networks can learn any function: [https://youtu.be/O45AaRPQhuI..."](https://youtu.be/O45AaRPQhuI...")...
-
Source: youtube.com
Title: Deep Residual Learning for Image Recognition: The Res Net Paper Explained
Link: https://www.youtube.com/watch?v=EQSp5L1cOBESource snippet
Why Residual Connections (ResNet) Work...
-
Source: youtube.com
Title: Why Residual Connections (Res Net) Work
Link: https://www.youtube.com/watch?v=Gey9CG6R6w8Source snippet
The "Shortcut" That Made AI See Like a Human...
-
Source: icml.cc
Title: icml2016 tutorial deep residual networks kaiminghe
Link: https://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
Topic Tree


