Within Skip links

Why gradients survive in residual networks

Identity shortcuts give error signals a cleaner route backward, helping earlier layers keep learning in very deep networks.

On this page

  • How backpropagation weakens in stacked layers
  • Why the identity path preserves learning signals
  • What later analyses say about gradient stability
Preview for Why gradients survive in residual networks

Introduction

Identity shortcuts help very deep neural networks remain trainable because they give learning signals a direct route through the network. In an ordinary deep stack, the error signal used during backpropagation must pass through many layers, each of which can shrink, distort, or block the gradient. As depth increases, useful gradients may become so weak that early layers stop learning effectively. Residual networks address this problem by adding identity shortcuts—paths that carry information forward unchanged and allow gradients to travel backward with minimal interference. Rather than solving every optimisation challenge, these shortcuts make it far easier for learning signals to survive across hundreds of layers, which is one of the key reasons deep residual networks became practical. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Gradient flow illustration 1

How backpropagation weakens in stacked layers

Training a neural network depends on backpropagation. The model computes an error at the output, then propagates information about that error backwards so each weight can be adjusted. In a shallow network this process is relatively straightforward. In a very deep network, however, the gradient is repeatedly transformed as it passes through layer after layer.

Each layer effectively multiplies and reshapes the incoming gradient. When this happens dozens or hundreds of times, the resulting signal can become extremely small. Earlier layers then receive little information about how they should change. Even when gradients do not completely vanish, they can become noisy or unstable, making optimisation increasingly difficult. This is one reason deeper plain networks often showed higher training error than shallower ones despite having greater theoretical capacity. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

The crucial point is that learning depends not only on a network’s representational power but also on whether useful error information can reach the parameters that need updating. If the gradient becomes weak before it reaches early layers, those layers learn slowly or not at all.

Why the identity path preserves learning signals

Residual networks introduce a shortcut that copies the input directly to the output of a residual block. Instead of forcing every signal to pass through multiple learned transformations, the architecture provides an alternative route that leaves the signal unchanged.

The key insight from later analyses of residual networks is that this identity path affects both forward information flow and backward gradient flow. During backpropagation, part of the gradient can travel through the shortcut connection without being repeatedly modified by weight matrices and nonlinear activations. As a result, the network retains a direct channel through which learning signals can reach earlier layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

A useful way to think about this mechanism is as a parallel route. The residual branch may alter, amplify, or reduce gradients, but the identity branch remains available. Because the shortcut contributes a direct term to the backward signal, the gradient is less dependent on the behaviour of every intermediate layer. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

This does not mean gradients become perfectly constant. The residual branch still influences learning, and extremely deep networks can still face optimisation challenges. However, the identity path prevents the network from relying exclusively on a long chain of transformations. That greatly reduces the risk that learning signals disappear before reaching the earliest layers. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Gradient flow illustration 2

Why an unchanged shortcut matters more than a modified one

Research following the original ResNet work examined what happens when the shortcut is altered. The results were revealing: identity shortcuts generally produced better optimisation behaviour than shortcuts that scaled, gated, or otherwise modified the signal.

Kaiming He and colleagues showed that when the shortcut remains an identity mapping, both forward activations and backward gradients can propagate more directly across many residual units. Their experiments found that introducing operations which interfere with the shortcut path makes optimisation harder and increases training difficulty. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

In practical terms, the shortcut is most valuable when it behaves as a clean highway rather than another transformation layer. Every extra operation on that route creates another opportunity for the gradient to be altered or weakened.

A concrete intuition: learning changes instead of rebuilding everything

The gradient-flow advantage is closely connected to how residual blocks formulate the learning task.

A conventional deep layer stack must learn a complete transformation at every stage. Residual blocks instead learn modifications to an existing representation. If the current representation is already useful, the residual branch can learn a small adjustment while the shortcut preserves the original information.

This means the optimiser is not forced to continually reconstruct information that earlier layers have already produced. The identity path guarantees that existing useful representations remain available, while the residual path focuses on refinements. As a consequence, gradients often correspond to smaller corrective updates rather than attempts to rebuild entire mappings from scratch. [Medium+2Michael Brenndoerfer]medium.comResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas…

The result is a network that behaves less like one enormous chain of dependent transformations and more like a sequence of incremental improvements layered on top of a stable signal.

Gradient flow illustration 3

What later analyses say about gradient stability

Subsequent theoretical and empirical work reinforced the idea that identity mappings are central to residual learning. The 2016 “Identity Mappings in Deep Residual Networks” analysis argued that direct signal propagation is a fundamental reason residual networks converge reliably at extreme depths. Experiments with pre-activation residual blocks further improved this property by keeping the shortcut path as close as possible to a true identity mapping. [arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Researchers also began viewing identity parameterisation as a broader optimisation principle. Rather than forcing deep models to learn around difficult transformations, architectures can be designed so that doing nothing—or staying close to the identity function—is easy. Theoretical work showed that this design choice changes the optimisation landscape in favourable ways and helps explain why residual architectures are easier to train than equally deep plain networks. [arXiv]arxiv.orgarXiv Identity Matters in Deep LearningarXiv Identity Matters in Deep Learning

Although modern explanations differ in detail, they generally agree on the central mechanism: identity shortcuts create stable routes for information and gradients. By ensuring that learning signals can move across many layers without excessive degradation, they allow very deep networks to continue updating early representations effectively. [arXiv+2Emergent Mind]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

The key takeaway

Identity shortcuts keep gradients usable because they provide a direct path through the network that does not repeatedly transform the learning signal. When gradients can travel backward through this clean route, early layers continue receiving meaningful updates even as network depth grows. The shortcut therefore acts less as a computational trick and more as a structural guarantee that information and error signals can survive long journeys through a deep model. That guarantee is one of the main reasons residual networks succeeded where many earlier very deep architectures struggled. [arXiv+2arXiv]arxiv.orgarXiv Identity Mappings in Deep Residual NetworksIdentity Mappings in Deep Residual NetworksMarch 16, 2016…Published: March 16, 2016

Amazon book picks

Further Reading

Books and field guides related to Why gradients survive in residual networks. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Strong coverage of backpropagation, gradients, and optimisation.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Title: arXiv Identity Mappings in Deep Residual Networks
    Link: https://arxiv.org/abs/1603.05027
    Source snippet

    Identity Mappings in Deep Residual NetworksMarch 16, 2016...

    Published: March 16, 2016

  2. Source: arxiv.org
    Link: https://arxiv.org/pdf/1603.05027
    Source snippet

    Kaiming He, Xiangyu... activation is used, these projection shortcuts are also with pre-activation.Read more...

  3. Source: medium.com
    Link: https://medium.com/%40tnodecode/resnet-e7e0cba19e04
    Source snippet

    ResNet. How skip connections enabled very deep…By embedding simple identity “skip” connections. ResNet enabled information to bypas...

  4. Source: mbrenndoerfer.com
    Link: https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers
    Source snippet

    Michael BrenndoerferResidual Connections: The Gradient Highways Enabling...6 Jun 2025 — By adding a simple shortcut that lets informatio...

  5. Source: arxiv.org
    Title: arXiv Identity Matters in [Deep Learning]({{ ‘deep-learning/’ | relative_url }})
    Link: https://arxiv.org/abs/1611.04231

  6. Source: arxiv.org
    Link: https://arxiv.org/html/2602.09190v1
    Source snippet

    Gradient Residual Connections9 Feb 2026 — In contrast, our gradient residual substantially improves approximation quality. We then introd...

  7. Source: lzhangstat.medium.com
    Title: reading notes identity mappings in deep residual networks e1980b3aa753
    Link: https://lzhangstat.medium.com/reading-notes-identity-mappings-in-deep-residual-networks-e1980b3aa753
    Source snippet

    Notes: Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blo...

  8. Source: medium.com
    Link: https://medium.com/deepreview/review-of-identity-mappings-in-deep-residual-networks-ad6533452f33
    Source snippet

    Review of Identity Mappings in Deep Residual NetworksDeep residual learning by Kaiming et al. provides a solid framework to optimize deep...

  9. Source: medium.com
    Link: https://medium.com/data-science/introduction-to-resnets-c0a830a288a4
    Source snippet

    Introduction to ResNetsThis shortcut connection is based on a more advanced description from the subsequent paper, “Identity Mappings in...

  10. Source: emergentmind.com
    Title: residual connections and identity mappings
    Link: https://www.emergentmind.com/topics/residual-connections-and-identity-mappings
    Source snippet

    Residual Connections & Identity MappingsFeb 15, 2026 — Explore how residual connections with identity mappings enhance deep neural networ...

  11. Source: collab.dvb.bayern
    Title: bayern Deep Residual Networks
    Link: https://collab.dvb.bayern/spaces/TUMlfdv/pages/69119929/Deep%2BResidual%2BNetworks
    Source snippet

    Residual Networks - BayernCollabHe, Kaiming, et al. "Identity mappings in deep residual networks." European Conference on Computer Vision...

  12. Source: youtube.com
    Link: https://www.youtube.com/watch?v=De8STTDu9Ss
    Source snippet

    Identity Mappings | Lecture 9 (Part 1) | Applied Deep LearningThis paper is proposing is that there should be no non-linearity or no oper...

Additional References

  1. Source: apxml.com
    Link: https://apxml.com/courses/cnns-for-computer-vision/chapter-1-cnn-foundations-modern-architectures/residual-connections-skip-architectures
    Source snippet

    ApX Machine LearningUnderstanding Residual Connections and Skip ArchitecturesIdentity Mappings in Deep Residual Networks, Kaiming He, Xia...

  2. Source: deepai.org
    Link: https://deepai.org/[machine-learning
    Source snippet

    Residual Connections DefinitionBy providing shortcuts, residual connections allow the gradient to be directly backpropagated to earlier l...

  3. Source: merriam-webster.com
    Link: https://www.merriam-webster.com/dictionary/identity
    Source snippet

    IDENTITY Definition & MeaningThe meaning of IDENTITY is the distinguishing character or personality of an individual: individuality. How...

  4. Source: researchgate.net
    Link: https://www.researchgate.net/publication/319770414_Identity_Mappings_in_Deep_Residual_Networks
    Source snippet

    Identity Mappings in Deep Residual NetworksDeep residual networks have emerged as a family of extremely deep architectures showing compel...

  5. Source: github.com
    Link: https://github.com/abhshkdz/papers/blob/master/reviews/identity-mappings-in-deep-residual-networks.md
    Source snippet

    Identity Mappings in Deep Residual NetworksIn this paper, they propose a residual block with both h(x) and f(x) as identity mappings, whi...

  6. Source: scispace.com
    Link: https://scispace.com/pdf/identity-mappings-in-deep-residual-networks-2cmcwh2vhk.pdf

  7. Source: researchgate.net
    Link: https://www.researchgate.net/publication/308277201_Identity_Mappings_in_Deep_Residual_Networks
    Source snippet

    Identity Mappings in Deep Residual NetworksIn this paper, we analyze the propagation formulations behind the residual building blocks, wh...

  8. Source: reddit.com
    Link: https://www.reddit.com/r/MachineLearning/comments/px3hzd/d_has_the_resnet_hypothesis_been_debunked/
    Source snippet

    [D] Has the ResNet Hypothesis been debunked?The ResNet architecture was [invented]({{ 'fake-citations-d81942/' | relative_url }}) to solve the degradation problem that has been empirical...

  9. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Identity-Mappings-in-Deep-Residual-Networks-He-Zhang/77f0a39b8e02686fd85b01971f8feb7f60971f80
    Source snippet

    Zhang, author. Jian Sun · Published in European Conference on… 16 March 2016 · Computer Science.Read more...

    Published: March 2016

  10. Source: github.com
    Link: https://github.com/FlorianMuellerklein/Identity-Mapping-ResNet-Lasagne
    Source snippet

    For Wide-ResNet the paper and test results are for depth 16 and width multiplier of 4. This...Read more...

Topic Tree

Follow this branch

Parent topic

Skip links Why do deep networks need shortcuts?

Related pages 2