Within Skip links
Why gradients survive in residual networks
Identity shortcuts give error signals a cleaner route backward, helping earlier layers keep learning in very deep networks.
On this page
- How backpropagation weakens in stacked layers
- Why the identity path preserves learning signals
- What later analyses say about gradient stability
Page outline Jump by section
Introduction
Identity shortcuts help very deep neural networks remain trainable because they give learning signals a direct route through the network. In an ordinary deep stack, the error signal used during backpropagation must pass through many layers, each of which can shrink, distort, or block the gradient. As depth increases, useful gradients may become so weak that early layers stop learning effectively. Residual networks address this problem by adding identity shortcuts—paths that carry information forward unchanged and allow gradients to travel backward with minimal interference. Rather than solving every optimisation challenge, these shortcuts make it far easier for learning signals to survive across hundreds of layers, which is one of the key reasons deep residual networks became practical. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
How backpropagation weakens in stacked layers
Training a neural network depends on backpropagation. The model computes an error at the output, then propagates information about that error backwards so each weight can be adjusted. In a shallow network this process is relatively straightforward. In a very deep network, however, the gradient is repeatedly transformed as it passes through layer after layer.
Each layer effectively multiplies and reshapes the incoming gradient. When this happens dozens or hundreds of times, the resulting signal can become extremely small. Earlier layers then receive little information about how they should change. Even when gradients do not completely vanish, they can become noisy or unstable, making optimisation increasingly difficult. This is one reason deeper plain networks often showed higher training error than shallower ones despite having greater theoretical capacity. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
The crucial point is that learning depends not only on a network’s representational power but also on whether useful error information can reach the parameters that need updating. If the gradient becomes weak before it reaches early layers, those layers learn slowly or not at all.
Why the identity path preserves learning signals
Residual networks introduce a shortcut that copies the input directly to the output of a residual block. Instead of forcing every signal to pass through multiple learned transformations, the architecture provides an alternative route that leaves the signal unchanged.
The key insight from later analyses of residual networks is that this identity path affects both forward information flow and backward gradient flow. During backpropagation, part of the gradient can travel through the shortcut connection without being repeatedly modified by weight matrices and nonlinear activations. As a result, the network retains a direct channel through which learning signals can reach earlier layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
A useful way to think about this mechanism is as a parallel route. The residual branch may alter, amplify, or reduce gradients, but the identity branch remains available. Because the shortcut contributes a direct term to the backward signal, the gradient is less dependent on the behaviour of every intermediate layer. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
This does not mean gradients become perfectly constant. The residual branch still influences learning, and extremely deep networks can still face optimisation challenges. However, the identity path prevents the network from relying exclusively on a long chain of transformations. That greatly reduces the risk that learning signals disappear before reaching the earliest layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
Why an unchanged shortcut matters more than a modified one
Research following the original ResNet work examined what happens when the shortcut is altered. The results were revealing: identity shortcuts generally produced better optimisation behaviour than shortcuts that scaled, gated, or otherwise modified the signal.
Kaiming He and colleagues showed that when the shortcut remains an identity mapping, both forward activations and backward gradients can propagate more directly across many residual units. Their experiments found that introducing operations which interfere with the shortcut path makes optimisation harder and increases training difficulty. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
In practical terms, the shortcut is most valuable when it behaves as a clean highway rather than another transformation layer. Every extra operation on that route creates another opportunity for the gradient to be altered or weakened.
A concrete intuition: learning changes instead of rebuilding everything
The gradient-flow advantage is closely connected to how residual blocks formulate the learning task.
A conventional deep layer stack must learn a complete transformation at every stage. Residual blocks instead learn modifications to an existing representation. If the current representation is already useful, the residual branch can learn a small adjustment while the shortcut preserves the original information.
This means the optimiser is not forced to continually reconstruct information that earlier layers have already produced. The identity path guarantees that existing useful representations remain available, while the residual path focuses on refinements. As a consequence, gradients often correspond to smaller corrective updates rather than attempts to rebuild entire mappings from scratch. [Medium+2Michael Brenndoerfer]medium.comResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas…
The result is a network that behaves less like one enormous chain of dependent transformations and more like a sequence of incremental improvements layered on top of a stable signal.
What later analyses say about gradient stability
Subsequent theoretical and empirical work reinforced the idea that identity mappings are central to residual learning. The 2016 “Identity Mappings in Deep Residual Networks” analysis argued that direct signal propagation is a fundamental reason residual networks converge reliably at extreme depths. Experiments with pre-activation residual blocks further improved this property by keeping the shortcut path as close as possible to a true identity mapping. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
Researchers also began viewing identity parameterisation as a broader optimisation principle. Rather than forcing deep models to learn around difficult transformations, architectures can be designed so that doing nothing—or staying close to the identity function—is easy. Theoretical work showed that this design choice changes the optimisation landscape in favourable ways and helps explain why residual architectures are easier to train than equally deep plain networks. [arXiv]arxiv.orgarXiv Identity Matters in Deep LearningarXiv Identity Matters in Deep Learning
Although modern explanations differ in detail, they generally agree on the central mechanism: identity shortcuts create stable routes for information and gradients. By ensuring that learning signals can move across many layers without excessive degradation, they allow very deep networks to continue updating early representations effectively. [arXiv+2Emergent Mind]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
The key takeaway
Identity shortcuts keep gradients usable because they provide a direct path through the network that does not repeatedly transform the learning signal. When gradients can travel backward through this clean route, early layers continue receiving meaningful updates even as network depth grows. The shortcut therefore acts less as a computational trick and more as a structural guarantee that information and error signals can survive long journeys through a deep model. That guarantee is one of the main reasons residual networks succeeded where many earlier very deep architectures struggled. [arXiv+2arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…
Amazon book picks
Further Reading
Books and field guides related to Why gradients survive in residual networks. Use these as the next step if you want deeper reading beyond the article.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Strong coverage of backpropagation, gradients, and optimisation.
Dive into Deep Learning
Useful treatment of deep architectures and gradient-based learning.
Neural Networks and Deep Learning
Explains gradient propagation and training dynamics.
Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...
Helps readers connect gradient concepts to practical networks.
Endnotes
-
Source: arxiv.org
Title: arXiv Identity Mappings in Deep Residual Networks
Link: https://arxiv.org/abs/1603.05027Source snippet
Identity Mappings in Deep Residual NetworksMarch 16, 2016...
Published: March 16, 2016
-
Source: arxiv.org
Link: https://arxiv.org/pdf/1603.05027Source snippet
Kaiming He, Xiangyu... activation is used, these projection shortcuts are also with pre-activation.Read more...
-
Source: medium.com
Link: https://medium.com/%40tnodecode/resnet-e7e0cba19e04Source snippet
ResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas...
-
Source: mbrenndoerfer.com
Link: https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformersSource snippet
Michael BrenndoerferResidual Connections: The Gradient Highways Enabling...6 Jun 2025 — By adding a simple shortcut that lets informatio...
-
Source: arxiv.org
Title: arXiv Identity Matters in [Deep Learning]({{ ‘deep-learning/’ | relative_url }})
Link: https://arxiv.org/abs/1611.04231 -
Source: arxiv.org
Link: https://arxiv.org/html/2602.09190v1Source snippet
Gradient Residual Connections9 Feb 2026 — In contrast, our gradient residual substantially improves approximation quality. We then introd...
-
Source: lzhangstat.medium.com
Title: reading notes identity mappings in deep residual networks e1980b3aa753
Link: https://lzhangstat.medium.com/reading-notes-identity-mappings-in-deep-residual-networks-e1980b3aa753Source snippet
Notes: Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blo...
-
Source: medium.com
Link: https://medium.com/deepreview/review-of-identity-mappings-in-deep-residual-networks-ad6533452f33Source snippet
Review of Identity Mappings in Deep Residual NetworksDeep residual learning by Kaiming et al. provides a solid framework to optimize deep...
-
Source: medium.com
Link: https://medium.com/data-science/introduction-to-resnets-c0a830a288a4Source snippet
Introduction to ResNetsThis shortcut connection is based on a more advanced description from the subsequent paper, “Identity Mappings in...
-
Source: emergentmind.com
Title: residual connections and identity mappings
Link: https://www.emergentmind.com/topics/residual-connections-and-identity-mappingsSource snippet
Residual Connections & Identity MappingsFeb 15, 2026 — Explore how residual connections with identity mappings enhance deep neural networ...
-
Source: collab.dvb.bayern
Title: bayern Deep Residual Networks
Link: https://collab.dvb.bayern/spaces/TUMlfdv/pages/69119929/Deep%2BResidual%2BNetworksSource snippet
Residual Networks - BayernCollabHe, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=De8STTDu9SsSource snippet
Identity Mappings | Lecture 9 (Part 1) | Applied Deep LearningThis paper is proposing is that there should be no non-linearity or no oper...
Additional References
-
Source: apxml.com
Link: https://apxml.com/courses/cnns-for-computer-vision/chapter-1-cnn-foundations-modern-architectures/residual-connections-skip-architecturesSource snippet
ApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesIdentity Mappings in Deep Residual Networks, Kaiming He, Xia...
-
Source: deepai.org
Link: https://deepai.org/[machine-learningSource snippet
Residual Connections DefinitionBy providing shortcuts, residual connections allow the gradient to be directly backpropagated to earlier l...
-
Source: merriam-webster.com
Link: https://www.merriam-webster.com/dictionary/identitySource snippet
IDENTITY Definition & MeaningThe meaning of IDENTITY is the distinguishing character or personality of an individual: individuality. How...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/319770414_Identity_Mappings_in_Deep_Residual_NetworksSource snippet
Identity Mappings in Deep Residual NetworksDeep residual networks have emerged as a family of extremely deep architectures showing compel...
-
Source: github.com
Link: https://github.com/abhshkdz/papers/blob/master/reviews/identity-mappings-in-deep-residual-networks.mdSource snippet
Identity Mappings in Deep Residual NetworksIn this paper, they propose a residual block with both h(x) and f(x) as identity mappings, whi...
-
Source: scispace.com
Link: https://scispace.com/pdf/identity-mappings-in-deep-residual-networks-2cmcwh2vhk.pdf -
Source: researchgate.net
Link: https://www.researchgate.net/publication/308277201_Identity_Mappings_in_Deep_Residual_NetworksSource snippet
Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blocks, wh...
-
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/px3hzd/d_has_the_resnet_hypothesis_been_debunked/Source snippet
[D] Has the ResNet Hypothesis been debunked?The ResNet architecture was [invented]({{ 'fake-citations-d81942/' | relative_url }}) to solve the degradation problem that has been empirical...
-
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Identity-Mappings-in-Deep-Residual-Networks-He-Zhang/77f0a39b8e02686fd85b01971f8feb7f60971f80Source snippet
Zhang, author. Jian Sun · Published in European Conference on… 16 March 2016 · Computer Science.Read more...
Published: March 2016
-
Source: github.com
Link: https://github.com/FlorianMuellerklein/Identity-Mapping-ResNet-LasagneSource snippet
For Wide-ResNet the paper and test results are for depth 16 and width multiplier of 4. This...Read more...
Topic Tree



