Why gradients survive in residual networks

Introduction

Identity shortcuts help very deep neural networks remain trainable because they give learning signals a direct route through the network. In an ordinary deep stack, the error signal used during backpropagation must pass through many layers, each of which can shrink, distort, or block the gradient. As depth increases, useful gradients may become so weak that early layers stop learning effectively. Residual networks address this problem by adding identity shortcuts—paths that carry information forward unchanged and allow gradients to travel backward with minimal interference. Rather than solving every optimisation challenge, these shortcuts make it far easier for learning signals to survive across hundreds of layers, which is one of the key reasons deep residual networks became practical. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Gradient flow illustration 1

How backpropagation weakens in stacked layers

Training a neural network depends on backpropagation. The model computes an error at the output, then propagates information about that error backwards so each weight can be adjusted. In a shallow network this process is relatively straightforward. In a very deep network, however, the gradient is repeatedly transformed as it passes through layer after layer.

Each layer effectively multiplies and reshapes the incoming gradient. When this happens dozens or hundreds of times, the resulting signal can become extremely small. Earlier layers then receive little information about how they should change. Even when gradients do not completely vanish, they can become noisy or unstable, making optimisation increasingly difficult. This is one reason deeper plain networks often showed higher training error than shallower ones despite having greater theoretical capacity. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

The crucial point is that learning depends not only on a network’s representational power but also on whether useful error information can reach the parameters that need updating. If the gradient becomes weak before it reaches early layers, those layers learn slowly or not at all.

Why the identity path preserves learning signals

Residual networks introduce a shortcut that copies the input directly to the output of a residual block. Instead of forcing every signal to pass through multiple learned transformations, the architecture provides an alternative route that leaves the signal unchanged.

The key insight from later analyses of residual networks is that this identity path affects both forward information flow and backward gradient flow. During backpropagation, part of the gradient can travel through the shortcut connection without being repeatedly modified by weight matrices and nonlinear activations. As a result, the network retains a direct channel through which learning signals can reach earlier layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

A useful way to think about this mechanism is as a parallel route. The residual branch may alter, amplify, or reduce gradients, but the identity branch remains available. Because the shortcut contributes a direct term to the backward signal, the gradient is less dependent on the behaviour of every intermediate layer. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

This does not mean gradients become perfectly constant. The residual branch still influences learning, and extremely deep networks can still face optimisation challenges. However, the identity path prevents the network from relying exclusively on a long chain of transformations. That greatly reduces the risk that learning signals disappear before reaching the earliest layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Gradient flow illustration 2

Why an unchanged shortcut matters more than a modified one

Research following the original ResNet work examined what happens when the shortcut is altered. The results were revealing: identity shortcuts generally produced better optimisation behaviour than shortcuts that scaled, gated, or otherwise modified the signal.

Kaiming He and colleagues showed that when the shortcut remains an identity mapping, both forward activations and backward gradients can propagate more directly across many residual units. Their experiments found that introducing operations which interfere with the shortcut path makes optimisation harder and increases training difficulty. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

In practical terms, the shortcut is most valuable when it behaves as a clean highway rather than another transformation layer. Every extra operation on that route creates another opportunity for the gradient to be altered or weakened.

A concrete intuition: learning changes instead of rebuilding everything

The gradient-flow advantage is closely connected to how residual blocks formulate the learning task.

A conventional deep layer stack must learn a complete transformation at every stage. Residual blocks instead learn modifications to an existing representation. If the current representation is already useful, the residual branch can learn a small adjustment while the shortcut preserves the original information.

This means the optimiser is not forced to continually reconstruct information that earlier layers have already produced. The identity path guarantees that existing useful representations remain available, while the residual path focuses on refinements. As a consequence, gradients often correspond to smaller corrective updates rather than attempts to rebuild entire mappings from scratch. [Medium+2Michael Brenndoerfer]medium.comResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas…

The result is a network that behaves less like one enormous chain of dependent transformations and more like a sequence of incremental improvements layered on top of a stable signal.

Gradient flow illustration 3

What later analyses say about gradient stability

Subsequent theoretical and empirical work reinforced the idea that identity mappings are central to residual learning. The 2016 “Identity Mappings in Deep Residual Networks” analysis argued that direct signal propagation is a fundamental reason residual networks converge reliably at extreme depths. Experiments with pre-activation residual blocks further improved this property by keeping the shortcut path as close as possible to a true identity mapping. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Researchers also began viewing identity parameterisation as a broader optimisation principle. Rather than forcing deep models to learn around difficult transformations, architectures can be designed so that doing nothing—or staying close to the identity function—is easy. Theoretical work showed that this design choice changes the optimisation landscape in favourable ways and helps explain why residual architectures are easier to train than equally deep plain networks. [arXiv]arxiv.orgarXiv Identity Matters in Deep LearningarXiv Identity Matters in Deep Learning

Although modern explanations differ in detail, they generally agree on the central mechanism: identity shortcuts create stable routes for information and gradients. By ensuring that learning signals can move across many layers without excessive degradation, they allow very deep networks to continue updating early representations effectively. [arXiv+2Emergent Mind]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

The key takeaway

Identity shortcuts keep gradients usable because they provide a direct path through the network that does not repeatedly transform the learning signal. When gradients can travel backward through this clean route, early layers continue receiving meaningful updates even as network depth grows. The shortcut therefore acts less as a computational trick and more as a structural guarantee that information and error signals can survive long journeys through a deep model. That guarantee is one of the main reasons residual networks succeeded where many earlier very deep architectures struggled. [arXiv+2arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Palace Learning 4 Pack - Cable Machine Workout Posters Volume 1 & Volume 2 + Dum

Search eBay.co.uk: machine learning poster

Browse similar on eBay.co.uk

Example eBay listing

Palace Learning 3 Pack - Cable Machine Workout Posters 18" x 24", LAMINATED

Search eBay.co.uk: machine learning poster

Browse similar on eBay.co.uk

Example eBay listing

Learning Machine - Smart Brain Educ Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: machine learning poster

Browse similar on eBay.co.uk

Example eBay listing

Palace Learning 3 Pack - Cable Machine Workout Posters 18" x 24", LAMINATED

Search eBay.co.uk: machine learning poster

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Title: arXiv Identity Mappings in Deep Residual Networks
Link: https://arxiv.org/abs/1603.05027
Source snippet
Identity Mappings in Deep Residual NetworksMarch 16, 2016...

Published: March 16, 2016
Source: arxiv.org
Link: https://arxiv.org/pdf/1603.05027
Source snippet
Kaiming He, Xiangyu... activation is used, these projection shortcuts are also with pre-activation.Read more...
Source: medium.com
Link: https://medium.com/%40tnodecode/resnet-e7e0cba19e04
Source snippet
ResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas...
Source: mbrenndoerfer.com
Link: https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers
Source snippet
Michael BrenndoerferResidual Connections: The Gradient Highways Enabling...6 Jun 2025 — By adding a simple shortcut that lets informatio...
Source: arxiv.org
Title: arXiv Identity Matters in [Deep Learning]({{ ‘deep-learning/’ | relative_url }})
Link: https://arxiv.org/abs/1611.04231
Source: arxiv.org
Link: https://arxiv.org/html/2602.09190v1
Source snippet
Gradient Residual Connections9 Feb 2026 — In contrast, our gradient residual substantially improves approximation quality. We then introd...
Source: lzhangstat.medium.com
Title: reading notes identity mappings in deep residual networks e1980b3aa753
Link: https://lzhangstat.medium.com/reading-notes-identity-mappings-in-deep-residual-networks-e1980b3aa753
Source snippet
Notes: Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blo...
Source: medium.com
Link: https://medium.com/deepreview/review-of-identity-mappings-in-deep-residual-networks-ad6533452f33
Source snippet
Review of Identity Mappings in Deep Residual NetworksDeep residual learning by Kaiming et al. provides a solid framework to optimize deep...
Source: medium.com
Link: https://medium.com/data-science/introduction-to-resnets-c0a830a288a4
Source snippet
Introduction to ResNetsThis shortcut connection is based on a more advanced description from the subsequent paper, “Identity Mappings in...
Source: emergentmind.com
Title: residual connections and identity mappings
Link: https://www.emergentmind.com/topics/residual-connections-and-identity-mappings
Source snippet
Residual Connections & Identity MappingsFeb 15, 2026 — Explore how residual connections with identity mappings enhance deep neural networ...
Source: collab.dvb.bayern
Title: bayern Deep Residual Networks
Link: https://collab.dvb.bayern/spaces/TUMlfdv/pages/69119929/Deep%2BResidual%2BNetworks
Source snippet
Residual Networks - BayernCollabHe, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision...
Source: youtube.com
Link: https://www.youtube.com/watch?v=De8STTDu9Ss
Source snippet
Identity Mappings | Lecture 9 (Part 1) | Applied Deep LearningThis paper is proposing is that there should be no non-linearity or no oper...

Additional References

Source: apxml.com
Link: https://apxml.com/courses/cnns-for-computer-vision/chapter-1-cnn-foundations-modern-architectures/residual-connections-skip-architectures
Source snippet
ApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesIdentity Mappings in Deep Residual Networks, Kaiming He, Xia...
Source: deepai.org
Link: https://deepai.org/[machine-learning
Source snippet
Residual Connections DefinitionBy providing shortcuts, residual connections allow the gradient to be directly backpropagated to earlier l...
Source: merriam-webster.com
Link: https://www.merriam-webster.com/dictionary/identity
Source snippet
IDENTITY Definition & MeaningThe meaning of IDENTITY is the distinguishing character or personality of an individual: individuality. How...
Source: researchgate.net
Link: https://www.researchgate.net/publication/319770414_Identity_Mappings_in_Deep_Residual_Networks
Source snippet
Identity Mappings in Deep Residual NetworksDeep residual networks have emerged as a family of extremely deep architectures showing compel...
Source: github.com
Link: https://github.com/abhshkdz/papers/blob/master/reviews/identity-mappings-in-deep-residual-networks.md
Source snippet
Identity Mappings in Deep Residual NetworksIn this paper, they propose a residual block with both h(x) and f(x) as identity mappings, whi...
Source: scispace.com
Link: https://scispace.com/pdf/identity-mappings-in-deep-residual-networks-2cmcwh2vhk.pdf
Source: researchgate.net
Link: https://www.researchgate.net/publication/308277201_Identity_Mappings_in_Deep_Residual_Networks
Source snippet
Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blocks, wh...
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/px3hzd/d_has_the_resnet_hypothesis_been_debunked/
Source snippet
[D] Has the ResNet Hypothesis been debunked?The ResNet architecture was [invented]({{ 'fake-citations-d81942/' | relative_url }}) to solve the degradation problem that has been empirical...
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Identity-Mappings-in-Deep-Residual-Networks-He-Zhang/77f0a39b8e02686fd85b01971f8feb7f60971f80
Source snippet
Zhang, author. Jian Sun · Published in European Conference on… 16 March 2016 · Computer Science.Read more...

Published: March 2016
Source: github.com
Link: https://github.com/FlorianMuellerklein/Identity-Mapping-ResNet-Lasagne
Source snippet
For Wide-ResNet the paper and test results are for depth 16 and width multiplier of 4. This...Read more...

Why gradients survive in residual networks

Introduction

How backpropagation weakens in stacked layers

Why the identity path preserves learning signals

Why an unchanged shortcut matters more than a modified one

A concrete intuition: learning changes instead of rebuilding everything

What later analyses say about gradient stability

The key takeaway

Further Reading

Deep Learning

Dive into Deep Learning

Neural Networks and Deep Learning

Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...

Marketplace Samples

Palace Learning 4 Pack - Cable Machine Workout Posters Volume 1 & Volume 2 + Dum

Palace Learning 3 Pack - Cable Machine Workout Posters 18" x 24", LAMINATED

Learning Machine - Smart Brain Educ Framed Wall Art Poster Canvas Print Picture

Palace Learning 3 Pack - Cable Machine Workout Posters 18" x 24", LAMINATED

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2