[读书笔记]《西瓜书》第五章 神经网络 补充四
![[读书笔记]《西瓜书》第五章 神经网络 补充四](/images/machine-learning_huda146d3825fc6d502e05b38609bff098_23368_900x500_fit_q75_box.jpg)
第五章 神经网络 补充四
PS: 这里本来想是进行整理一下的,自己学习过程中网上搜集的资料的,但是奈何大多数博文都是一大堆公式,看的人索然无味,但在自己准备着手开始自己写的时候,发现了一篇好文,简单清晰明了,没有一大堆的公式,于是翻译在此。
原文链接: Backpropagation
Backpropagation is an algorithm that calculate the partial derivative of every node on your model (ex: Convnet, Neural network). Those partial derivatives are going to be used during the training phase of your model, where a loss function states how much far your are from the correct result. This error is propagated backward from the model output back to it’s first layers. The backpropagation is more easily implemented if you structure your model as a computational graph.
The most important thing to have in mind here is how to calculate the forward propagation of each block and it’s gradient. Actually most of the deep learning libraries code is about implementing those gates forward/backward code.
Some examples of basic blocks are, add, multiply, exp, max. All we need to do is observe their forward and backward calculation
Some other derivatives:
$$f(x) = \frac{1}{x} \quad \rightarrow \quad \frac{df}{dx} = - \frac{1}{x^2} $$
$$f_c(x) = c + x \quad \rightarrow \quad \frac{df}{dx} = 1$$
$$f(x) = e^x \quad \rightarrow \quad \frac{df}{dx} = e^x$$
$$f_a(x) = ax \quad \rightarrow \quad \frac{df}{dx} = a$$
Observe that we output 2 gradients because we have 2 inputs… Also observe that we need to save (cache) on memory the previous inputs.
观察到这里我们输出了2个梯度值,是因为我们有2个输入… 还观察到我们需要将先前的输入保存(缓存)在内存中。
Imagine that you have an output $y$, that is function of $g$, which is function of $f$, which is function of $x$. If you want to know how much $g$ will change with a small change on $dx(\frac{dg}{dx})$, we use the chain rule. Chain rule is a formula for computing the derivative of the composition of two or more functions.
想象一下,您有一个输出 $y$,它是 $g$ 的函数输出,而 $g$ 又是 $f$ 的函数,$f$ 又是 $x$ 的函数。 如果您想知道 $g$ 随 $dx(\frac{dg}{dx})$ 的微小变化而变化多少,我们使用链式规则。 链规则是用于计算两个或多个函数组成的导数的公式。
The chain rule is the work horse of back-propagation, so it’s important to understand it now. On the picture bellow we get a node $f(x,y)$ that compute some function with two inputs $x$ , $y$ and output $z$. Now on the right side, we have this same node receiving from somewhere (loss function) a gradient $dL/dz$ which means. “How much $L$ will change with a small change on $z$”. As the node has 2 inputs it will have 2 gradients. One showing how $L$ will a small change $dx$ and the other showing how $L$ will change with a small change $dz$
链式规则是反向传播的工作原理,因此现在了解它非常重要。 在下面的图片中,我们得到一个节点 $f(x,y)$,该节点的函数用两个输入 $x$,$y$ 和输出 $z$ 计算。现在在右侧,该同一个节点从某处(损失函数)接收梯度 $dL/dz$ ,这意味着 “随着 $z$ 的微小变化,$L$ 会变化多少”。 由于节点具有2个输入,因此它将具有 $2$ 个梯度。一个显示 $L$ 如何随着 $dx$ 的小变化而变化,另一个显示 $L$ 如何随着 $dz$ 的小变化而变化
In order to calculate the gradients we need the input $dL/dz$ ($dout$), and the derivative of the function $f(x,y)$, at that particular input, then we just multiply them. Also we need the previous cached input, saved during forward propagation.
为了计算梯度,则我们需要计算 $dL/dz$ 的输入 和 $f(x, y)$ 的导数,并将其进行相乘,另外我们需要在前向传播的过程中在缓存中保存的输入。
Observe bellow the implementation of the multiply and add gate on python
With what we learn so far, let’s calculate the partial derivatives of some graphs.
Here we have a graph for the function $f(x,y,z) = (x+y)*z$
- Start from output node $f$, and consider that the gradient of $f$ related to some criteria is $1$.
- $dq=(dout(1) *z)$, which is -4 (How the output will change with a change in $q$
- $dz=(dout(1)* q)$, which is 3 (How the output will change with a change in $z$
- The sum gate distribute it’s input gradients, so $dx=-4$, $dy=-4$ (How the output will change with $x$, $z$)
This following graph represent the forward propagation of a simple 2 inputs, neural network with one output layer with sigmoid activation.
- Start from the output node, considering that or error($dout$) is $1$
- The gradient of the input of the $1/x$ will be $-1/(1.37^2)$, $-0.53$
- The increment node does not change the gradient on it’s input, so it will be $(-0.53 * 1)$, $-0.53$
- The exp node input gradient will be $(\exp(-1(\text{cached input})) * -0.53)$, $-0.2$
- The negative gain node will be it’s input gradient $(-1 * -0.2)$, $0.2$
- The sum node will distribute the gradients, so, $dw2=0.2$, and the sum node also $0.2$
- The sum node again distribute the gradients so again $0.2$
- $dw0$ will be $(0.2 * -1)$, $-0.2$
- $dx0$ will be $(0.2 * 2)$ , $0.4$
PS: 这里是补充的额外内容了~
PS: 这个注意下,梯度消失与梯度爆炸问题都是在反向传播的过程中,即从后向前传导的过程中出现的问题
梯度消失(gradient vanishing problem)
梯度爆炸(gradient exploding problem)
而梯度爆炸正好与上图所示的梯度消失的问题正好相反,当 $\text{abs}(w) \ge 1$ 时,那么层数增多的时候,最终的求出的梯度更新将以指数形式增加,即发生梯度爆炸
此方法来自Hinton在2006年发表的一篇论文,Hinton为了解决梯度的问题,提出采取无监督逐层训练方法,其基本思想是每次训练一层隐节点,训练时将上一层隐节点的输出作为输入,而本层隐节点的输出作为下一层隐节点的输入,此过程是逐层“预训练”(pre-training);在预训练完成后,再对整个网络进行“微调”(fine-tunning)。Hinton在训练深度信念网络(Deep Belief Networks中,使用了这个方法,在各层预训练完成后,再利用BP算法对整个网络进行训练。此思想相当于是先寻找局部最优,然后整合起来寻找全局最优,此方法有一定的好处,但是目前应用的不是很多了。
另外一种解决梯度爆炸的手段是采用权重正则化(weithts regularization)。
(3)使用 Relu 激活函数家族
(4)批量标准化(Batch Normalization)
Batchnorm是深度学习发展以来提出的最重要的成果之一了,目前已经被广泛的应用到了各大网络中,具有加速网络收敛速度,提升训练稳定性的效果,Batchnorm本质上是解决反向传播过程中的梯度问题。batchnorm全名是batch normalization,简称BN,即批规范化,通过规范化操作将输出信号x规范化到均值为0,方差为1保证网络的稳定性。
BP网络 主要用于以下四个方面。