当前位置:网站首页>Is Adam really the best optimizer? Some people think it's just the result of neural network evolution

Is Adam really the best optimizer? Some people think it's just the result of neural network evolution

2020-12-08 08:25:53 osc_ gkcftr6g

Mention optimizer , Most people think of Adam. since 2015 Since its launch in ,Adam It's always been in the field 「 The king 」. But in recent days, , An assistant professor at Boston University made a hypothesis , He thinks that Adam Maybe not the best optimizer , It's just the training of the neural network that makes it the best .

Almost Human reports , author : Du Wei 、 Devil .

Adam Optimizer is one of the most popular optimizers in deep learning . It applies to many kinds of problems , Including models with sparse or noisy gradients . It is easy to fine tune so that it can get good results quickly , actually , The default parameter configuration usually achieves good results .Adam The optimizer combines AdaGrad and RMSProp The advantages of .Adam Use the same learning rate for each parameter , And adapt independently with the progress of learning . Besides ,Adam It's a momentum based algorithm , Using historical information from gradients . Based on these characteristics , When choosing the optimization algorithm ,Adam Tend to be 「 not pass on to others what one is called upon to do 」.

But in recent days, , Assistant professor at Boston University Francesco Orabona Put forward a hypothesis , He thinks that 「 No Adam The best , It's the training of the neural network that makes it the best 」. In an article he detailed his hypothesis , The original content is as follows :

I've been working on online and stochastic optimization for some time .2015 year Adam When it was put forward , I'm already in this field .Adam By the current Google senior research scientist Diederik P. Kingma And assistant professor at the University of Toronto Jimmy Ba In the paper 《ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION》 It is proposed that .

Adam Algorithm

This paper is very good , But it's not a breakthrough , Even more so by current standards . First , The theory is fragile : For an algorithm that should deal with nonconvex function stochastic optimization, this paper gives regret guarantee. secondly , Experiments are also fragile : Recently, exactly the same experiment will be completely rejected . Later, it was discovered that there was an error in the proof , also Adam The algorithm can not converge on some one-dimensional stochastic convex functions . Even though there are such problems ,Adam Still considered to be in the optimization algorithm 「 The king 」.

So it needs to be clear : We all know ,Adam You don't always get the best performance , But most of the time , People think it's possible to use Adam At least sub optimal performance is achieved in dealing with a deep learning problem . let me put it another way ,Adam Considered the default optimizer for deep learning today . that ,Adam What's the secret of success ?

In recent years , A lot of papers have been published , Try to explain Adam And its performance . from 「 Adaptive learning rate 」( Adaptive what ? No one really understands ) To momentum and scale invariance ,Adam There is a corresponding interpretation of all aspects of . however , All of these analyses don't give the final answer about its performance .

Obviously , Most of these factors, such as adaptive learning rate, are beneficial to the optimization process of any function , But we still don't know , Why can these factors be combined in this way Adam Become the best algorithm . The balance between the elements is so delicate , Even the small change needed to solve the non convergence problem is thought to bring about more than Adam Slightly worse performance .

however , How likely is all this ? Well, I mean ,Adam Is it really the best optimization algorithm ? In such a 「 young 」 In the field of , How likely is it to achieve the best deep learning optimization a few years ago ? Yes Adam Is there another explanation for this amazing performance ?

therefore , I put forward a hypothesis , But before you explain it , It is necessary for us to talk briefly about the applied deep learning community .

Google machine learning researcher Olivier Bousquet In a speech , Describe the deep learning community as a giant genetic algorithm : Community researchers are exploring all variants of algorithms and architectures in a semi random way . In large-scale experiments, persistent algorithms are preserved , The invalid is abandoned . What we need to pay attention to , This process seems to have nothing to do with whether the paper is rejected or not : The community is too big and active , well idea Even if rejected, it can be retained and transformed into best practice in a few months , for example Loshchilov and Hutter The study of 《Decoupled Weight Decay Regularization》. Similarly , In published papers idea By hundreds of people trying to reproduce , And those that cannot be reproduced are cruelly abandoned . This process creates many Heuristics , In other words, excellent results have been output in the experiment , But the pressure is also 「 always 」. You bet , Although it's based on nonconvex formulas , But the performance of deep learning method is very reliable .( Be careful , Deep learning community for 「 celebrity 」 It has a great tendency , Not all idea Can get the same attention ……)

So this giant genetic algorithm and Adam What is the connection between ? Take a close look at the deep learning community idea After the creation process , I found a rule : New architectures that people create tend to have fixed optimization algorithms , And most of the time , The optimization algorithm is Adam. This is because ,Adam Is the default optimizer .

Here comes my hypothesis :Adam For the existing neural network architecture many years ago, it is a good optimization algorithm , So people have been creating Adam Effective new architecture . We may not see Adam Invalid Architecture , Because of this kind of idea Already abandoned ! This kind of idea It requires the design of a new architecture and a new optimizer at the same time , And it's a very difficult task . in other words , Most of the time , Community researchers just need to improve a set of parameters ( framework 、 Initialization strategy 、 Super parameter search algorithm, etc ), And keep the optimizer as Adam.

I'm sure a lot of people won't believe this assumption , They will list all the Adam It's not a specific problem of the optimal optimization algorithm , For example, momentum gradient descent is the optimal optimization algorithm . however , I want to point out two points :

  • I'm not describing a natural law , It's just a statement of community trends , This tendency may have affected the co evolution of some architectures and optimizers ;
  • I have evidence to support this hypothesis .

If my conclusion is true , Then we expect Adam It works well on deep neural networks , But on other models, it's bad . And it did happen ! for example ,Adam Poor performance on simple convex and nonconvex problems , for example Vaswani wait forsomeone 《Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates》 The experiment in :

Now it seems that it's time to discard the neural network specific settings ( initialization 、 The weight 、 Loss function, etc ) It's time ,Adam Lost its adaptability , Its magic like default learning rate must be adjusted again . Be careful , You can write a linear predictor as a layer of neural network , but Adam It's not good in this situation . therefore , The evolution of all the specific choices of deep learning architecture may be to make Adam It's getting better and better , And the simple question above does not have this kind of characteristic , therefore Adam Lost its luster in it .

All in all ,Adam Probably the best optimizer , Because the deep learning community is just exploring the architecture / The optimizers work together to search for a small area of the space . If so , So it's ironic for a community that has abandoned convex methods because of its narrow focus on machine learning algorithms . It's like Facebook chief AI scientists Yann LeCun What I have said :「 The key fell in the dark , We're looking for it in the visible light .」

「 novel 」 Let's suppose it's a hot topic for netizens

The assistant professor's hypothesis is reddit On the Internet triggered a heated discussion , But it's just an ambiguous point of view , No one can prove that the hypothesis holds .

One netizen thinks that the hypothesis may not be completely correct, but it is interesting , And put forward a further point of view :Adam With other methods in simple MLP Which is better or worse ? Compared with the loss surface of the general optimization problem , Maybe it's just the loss surface of neural networks that makes them fit naturally Adam. If Adam stay MLP Worse performance on , Then there's more evidence .

Another netizen thinks there is such a possibility .Adam It has been used in most of the papers since its launch , Some of the other efficient architectures that people have found depend on it , For the use of NAS Or the architecture of similar methods . But in practice , Many architectures can also be well adapted to other optimizers . also , Now a lot of new papers are using Ranger And other optimizers . Besides , About Adam Another way of saying is , If it's really adaptive , Then we don't need a learning rate Finder (finder) And scheduler (scheduler) 了 .

Link to the original text :https://parameterfree.com/2020/12/06/neural-network-maybe-evolved-to-make-adam-the-best-optimizer/reddit

link :https://www.reddit.com/r/MachineLearning/comments/k7yn1k/d_neural_networks_maybe_evolved_to_make_adam_the/

本文为[osc_ gkcftr6g]所创,转载请带上原文链接,感谢