The optimization process of deep learning models is based on the gradient descent method. Deep learning frameworks such as PyTorch and Tensorflow can be divided into three parts: model api, gradient calculation and gpu acceleration. Gradient calculation plays an important role, and the core technology of this part is automatic differentiation.

**1 Differential method**

There are four differential methods:

Manual differentiation is to use the derivation formula to manually write the derivation formula. This method is accurate and effective, and the only disadvantage is that it takes effort.

Numerical differentiation uses the definition of derivative.This method is simple to implement, but there are two serious problems: truncation error and roundoff error. But this method can be a good way to check whether the gradient is accurate.

Another method is symbolic differentiation, which transfers the work we did in manual differentiation to the computer. The problem with this method is that the expression must be closed-form, that is, there cannot be loops and conditional expressions. So that the entire problem can be converted into a pure mathematical symbol problem can be solved using some algebraic software. However, when expressions are complex, the problem of "expression swell" is prone to occur.

The last is our protagonist: automatic differentiation. It is also the most widely used derivation method in programe.

**2 Automatic differentiation**

The automatic differentiation discovers the essence of differential calculation: Differential calculation is a combination of a limited series of differentiable operators.

We can regarded the formula: f(x1, x2)=ln(x1)+x1x2-sin(x2) as a calculation graph (What’s more, it can be regarded as a tree structure, too). In the process of forward calculation, we can obtain the value of each node.

Then we can express the derivation process of df/dx1** **as a series of differential operator combinations. The calculation can be divided into two types: calculating the formula from forward to backward is called Forward Mode, and calculating the formula from backward to forward is called Reverse Mode.

The gradient values calculated by the two modes are the same, but for the calculation order is different, the calculation speed is different. Generally, if the Jacobian matrix is relatively high, then the forward mode is more efficient; if the Jacobian matrix is wider, then the reverse mode is more efficient.

**3 JVP, VJP and vmap**

If you have used pytorch, you will find that if y is a tensor instead of a scalar, you will be asked to pass a grad_variables in y.backward(). And the derivative result x.grad has the same shape as x. Where is the Jacobian matrix?

The reason is that deep learning frameworks such as Tensorflow and PyTorch prohibit the derivatives with tensor by tensor, but only retain scalar by tensor. When we call y.backward () and enter a grad_variables v. In fact, it actually converts y into a weighted sum l = torch.sum(y * v) , where l is a scalar, and then the gradient of x.grad is naturally of the same shape as x. The reason for this is that the loss of deep learning is definitely a scalar, and gradient descent requires that the gradient must be of the same type as x.

But what if we want to obtain the Jacobian matrix?

The answer is to derive x for each value of y.In addition, Google's new deep learning framework JAX uses a more advanced method, the vectorization operation vmap to speed up the calculation.

**4 Reference**