sangyx's Blog

[Week 3] Check-in

sangyx
Published: 06/21/2020

1. What did you do this week?

This week's job is to add test cases and change the code as mentor's request. 

2. Difficulty

No difficulties this week.

3. What is coming up next?

The work for the next week is to merge vjp and start to work on jacobian.

View Blog Post

[Week 2] Check-in

sangyx
Published: 06/14/2020

1. What did you do this week?

This week's main job is to implement the backward propagation. After that, I complete the whole vjp. 

Because of the bad support for figure and code, more detail see: https://sangyx.com/1790

2. Difficulty

I complicated this problem at the beginning. Later I found that imitating pytorch's API to implement is more intuitive and simple. 

In addition, because I am not very familiar with uarray, some implementations are not elegent.

3. What is coming up next?

The work for the next week is to test vjp and work with my mentor to rewrite the code more elegent.

View Blog Post

[Week 1] Check-in

sangyx
Published: 06/11/2020

1. What did you do this week?

This week's main job is to build a calculation graph. The core of the automatic differential system is vjp, which is composed of calculation graph construction and gradient calculation.

2. Difficulty

The computational graph construction is not as simple as I thought. The main problem encountered in the process is that some basic differential operators not only need to pass in the tensors, but also need to pass in some parameters of the function, such as np.sum. This requires that some parameters required by the differential operator be passed in advance when constructing the calculation graph. In addition, according to the different calculation paths, the parents of each node in the calculation graph should be marked appropriately, and then they can be calculated along the correct path during back propagation.

3. What is coming up next?

The work for the next week is to implement simple back propagation and complete a complete differential operation.

View Blog Post

[Community Bonding Period] What is Automatic Differentiation?

sangyx
Published: 05/25/2020

The optimization process of deep learning models is based on the gradient descent method. Deep learning frameworks such as PyTorch and Tensorflow can be divided into three parts: model api, gradient calculation and gpu acceleration. Gradient calculation plays an important role, and the core technology of this part is automatic differentiation.

1 Differential method

There are four differential methods:

  • Manual differentiation

  • Numerical differentiation

  • Sign differential

  • Automatic differentiation

Manual differentiation is to use the derivation formula to manually write the derivation formula. This method is accurate and effective, and the only disadvantage is that it takes effort.

Numerical differentiation uses the definition of derivative.This method is simple to implement, but there are two serious problems: truncation error and roundoff error. But this method can be a good way to check whether the gradient is accurate.

Another method is symbolic differentiation, which transfers the work we did in manual differentiation to the computer. The problem with this method is that the expression must be closed-form, that is, there cannot be loops and conditional expressions. So that the entire problem can be converted into a pure mathematical symbol problem can be solved using some algebraic software. However, when expressions are complex, the problem of "expression swell" is prone to occur.

The last is our protagonist: automatic differentiation. It is also the most widely used derivation method in programe.

2 Automatic differentiation

The automatic differentiation discovers the essence of differential calculation: Differential calculation is a combination of a limited series of differentiable operators.

We can regarded the formula: f(x1, x2)=ln(x1)+x1x2-sin(x2) as a calculation graph (What’s more, it can be regarded as a tree structure, too). In the process of forward calculation, we can obtain the value of each node.

Then we can express the derivation process of  df/dx1 as a series of differential operator combinations. The calculation can be divided into two types: calculating the formula from forward to backward is called Forward Mode, and calculating the formula from backward to forward is called Reverse Mode. 

The gradient values ​​calculated by the two modes are the same, but for the calculation order is different, the calculation speed is different. Generally, if the Jacobian matrix is relatively high, then the forward mode is more efficient; if the Jacobian matrix is ​​wider, then the reverse mode is more efficient.

3 JVP, VJP and vmap

If you have used pytorch, you will find that if y is a tensor instead of a scalar, you will be asked to pass a  grad_variables in y.backward(). And the derivative result x.grad has the same shape as x. Where is the Jacobian matrix?

The reason is that deep learning frameworks such as Tensorflow and PyTorch prohibit the derivatives with tensor by tensor, but only retain scalar by tensor. When we call y.backward () and enter a grad_variables v. In fact, it actually converts y into a weighted sum l = torch.sum(y * v) , where l is a scalar, and then the gradient of x.grad is naturally of the same shape as x. The reason for this is that the loss of deep learning is definitely a scalar, and gradient descent requires that the gradient must be of the same type as x.

But what if we want to obtain the Jacobian matrix?

The answer is to derive x for each value of y.In addition, Google's new deep learning framework JAX uses a more advanced method, the vectorization operation vmap to speed up the calculation. 

4 Reference

 

View Blog Post