Approximate Architecture Gradient #153

buttercutter · 2021-01-14T10:09:43Z

Why Evaluating the finite difference requires only two forward passes for the weights and two backward passes for α, and the complexity is reduced from O(|α||w|) to O(|α|+|w|) ?
Looking at equation 7, we have a second-order partial derivative which is computationally expensive to compute. To solve this, the finite difference method is used. <-- how is second-order partial derivative related to finite difference method ?
We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective in equation 6 is modiﬁed accordingly and all of our analysis still applies. <-- How is momentum directly related to the need of applying chain rule to equation 6 ?

The text was updated successfully, but these errors were encountered:

Provide feedback