当前位置：网站首页>Is Adam really the best optimizer? Some people think it's just the result of neural network evolution
Is Adam really the best optimizer? Some people think it's just the result of neural network evolution
2020-12-08 08:25:53 【osc_ gkcftr6g】
Mention optimizer , Most people think of Adam. since 2015 Since its launch in ,Adam It's always been in the field 「 The king 」. But in recent days, , An assistant professor at Boston University made a hypothesis , He thinks that Adam Maybe not the best optimizer , It's just the training of the neural network that makes it the best .
Almost Human reports , author ： Du Wei 、 Devil .
Adam Optimizer is one of the most popular optimizers in deep learning . It applies to many kinds of problems , Including models with sparse or noisy gradients . It is easy to fine tune so that it can get good results quickly , actually , The default parameter configuration usually achieves good results .Adam The optimizer combines AdaGrad and RMSProp The advantages of .Adam Use the same learning rate for each parameter , And adapt independently with the progress of learning . Besides ,Adam It's a momentum based algorithm , Using historical information from gradients . Based on these characteristics , When choosing the optimization algorithm ,Adam Tend to be 「 not pass on to others what one is called upon to do 」.
But in recent days, , Assistant professor at Boston University Francesco Orabona Put forward a hypothesis , He thinks that 「 No Adam The best , It's the training of the neural network that makes it the best 」. In an article he detailed his hypothesis , The original content is as follows ：
I've been working on online and stochastic optimization for some time .2015 year Adam When it was put forward , I'm already in this field .Adam By the current Google senior research scientist Diederik P. Kingma And assistant professor at the University of Toronto Jimmy Ba In the paper 《ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION》 It is proposed that .Adam Algorithm
This paper is very good , But it's not a breakthrough , Even more so by current standards . First , The theory is fragile ： For an algorithm that should deal with nonconvex function stochastic optimization, this paper gives regret guarantee. secondly , Experiments are also fragile ： Recently, exactly the same experiment will be completely rejected . Later, it was discovered that there was an error in the proof , also Adam The algorithm can not converge on some one-dimensional stochastic convex functions . Even though there are such problems ,Adam Still considered to be in the optimization algorithm 「 The king 」.
So it needs to be clear ： We all know ,Adam You don't always get the best performance , But most of the time , People think it's possible to use Adam At least sub optimal performance is achieved in dealing with a deep learning problem . let me put it another way ,Adam Considered the default optimizer for deep learning today . that ,Adam What's the secret of success ？
In recent years , A lot of papers have been published , Try to explain Adam And its performance . from 「 Adaptive learning rate 」（ Adaptive what ？ No one really understands ） To momentum and scale invariance ,Adam There is a corresponding interpretation of all aspects of . however , All of these analyses don't give the final answer about its performance .
Obviously , Most of these factors, such as adaptive learning rate, are beneficial to the optimization process of any function , But we still don't know , Why can these factors be combined in this way Adam Become the best algorithm . The balance between the elements is so delicate , Even the small change needed to solve the non convergence problem is thought to bring about more than Adam Slightly worse performance .
however , How likely is all this ？ Well, I mean ,Adam Is it really the best optimization algorithm ？ In such a 「 young 」 In the field of , How likely is it to achieve the best deep learning optimization a few years ago ？ Yes Adam Is there another explanation for this amazing performance ？
therefore , I put forward a hypothesis , But before you explain it , It is necessary for us to talk briefly about the applied deep learning community .
Google machine learning researcher Olivier Bousquet In a speech , Describe the deep learning community as a giant genetic algorithm ： Community researchers are exploring all variants of algorithms and architectures in a semi random way . In large-scale experiments, persistent algorithms are preserved , The invalid is abandoned . What we need to pay attention to , This process seems to have nothing to do with whether the paper is rejected or not ： The community is too big and active , well idea Even if rejected, it can be retained and transformed into best practice in a few months , for example Loshchilov and Hutter The study of 《Decoupled Weight Decay Regularization》. Similarly , In published papers idea By hundreds of people trying to reproduce , And those that cannot be reproduced are cruelly abandoned . This process creates many Heuristics , In other words, excellent results have been output in the experiment , But the pressure is also 「 always 」. You bet , Although it's based on nonconvex formulas , But the performance of deep learning method is very reliable .（ Be careful , Deep learning community for 「 celebrity 」 It has a great tendency , Not all idea Can get the same attention ……）
So this giant genetic algorithm and Adam What is the connection between ？ Take a close look at the deep learning community idea After the creation process , I found a rule ： New architectures that people create tend to have fixed optimization algorithms , And most of the time , The optimization algorithm is Adam. This is because ,Adam Is the default optimizer .
Here comes my hypothesis ：Adam For the existing neural network architecture many years ago, it is a good optimization algorithm , So people have been creating Adam Effective new architecture . We may not see Adam Invalid Architecture , Because of this kind of idea Already abandoned ！ This kind of idea It requires the design of a new architecture and a new optimizer at the same time , And it's a very difficult task . in other words , Most of the time , Community researchers just need to improve a set of parameters （ framework 、 Initialization strategy 、 Super parameter search algorithm, etc ）, And keep the optimizer as Adam.
I'm sure a lot of people won't believe this assumption , They will list all the Adam It's not a specific problem of the optimal optimization algorithm , For example, momentum gradient descent is the optimal optimization algorithm . however , I want to point out two points ：
- I'm not describing a natural law , It's just a statement of community trends , This tendency may have affected the co evolution of some architectures and optimizers ;
- I have evidence to support this hypothesis .
If my conclusion is true , Then we expect Adam It works well on deep neural networks , But on other models, it's bad . And it did happen ！ for example ,Adam Poor performance on simple convex and nonconvex problems , for example Vaswani wait forsomeone 《Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates》 The experiment in ：
Now it seems that it's time to discard the neural network specific settings （ initialization 、 The weight 、 Loss function, etc ） It's time ,Adam Lost its adaptability , Its magic like default learning rate must be adjusted again . Be careful , You can write a linear predictor as a layer of neural network , but Adam It's not good in this situation . therefore , The evolution of all the specific choices of deep learning architecture may be to make Adam It's getting better and better , And the simple question above does not have this kind of characteristic , therefore Adam Lost its luster in it .
All in all ,Adam Probably the best optimizer , Because the deep learning community is just exploring the architecture / The optimizers work together to search for a small area of the space . If so , So it's ironic for a community that has abandoned convex methods because of its narrow focus on machine learning algorithms . It's like Facebook chief AI scientists Yann LeCun What I have said ：「 The key fell in the dark , We're looking for it in the visible light .」
「 novel 」 Let's suppose it's a hot topic for netizens
The assistant professor's hypothesis is reddit On the Internet triggered a heated discussion , But it's just an ambiguous point of view , No one can prove that the hypothesis holds .
One netizen thinks that the hypothesis may not be completely correct, but it is interesting , And put forward a further point of view ：Adam With other methods in simple MLP Which is better or worse ？ Compared with the loss surface of the general optimization problem , Maybe it's just the loss surface of neural networks that makes them fit naturally Adam. If Adam stay MLP Worse performance on , Then there's more evidence .
Another netizen thinks there is such a possibility .Adam It has been used in most of the papers since its launch , Some of the other efficient architectures that people have found depend on it , For the use of NAS Or the architecture of similar methods . But in practice , Many architectures can also be well adapted to other optimizers . also , Now a lot of new papers are using Ranger And other optimizers . Besides , About Adam Another way of saying is , If it's really adaptive , Then we don't need a learning rate Finder （finder） And scheduler （scheduler） 了 .
Link to the original text ：https://parameterfree.com/2020/12/06/neural-network-maybe-evolved-to-make-adam-the-best-optimizer/reddit
- C++ 数字、string和char*的转换
- Won the CKA + CKS certificate with the highest gold content in kubernetes in 31 days!
- C + + number, string and char * conversion
- C + + Learning -- capacity() and resize() in C + +
- C + + Learning -- about code performance optimization
C + + programming experience (6): using C + + style type conversion
Latest party and government work report ppt - Park ppt
Online ID number extraction birthday tool
Field pointer? Dangling pointer? This article will help you understand!
GVRP of hcna Routing & Switching
- LeetCode 91. 解码方法
- Seq2seq implements chat robot
- [chat robot] principle of seq2seq model
- Leetcode 91. Decoding method
- HCNA Routing＆Switching之GVRP
- GVRP of hcna Routing & Switching
- HDU7016 Random Walk 2
- [Code+＃1]Yazid 的新生舞会
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- HDU7016 Random Walk 2
- [code + 1] Yazid's freshman ball
- CF1548C The Three Little Pigs
- HDU7033 Typing Contest
- Qt Creator 自动补齐变慢的解决
- HALCON 20.11：如何处理标定助手品质问题
- HALCON 20.11：标定助手使用注意事项
- Solution of QT creator's automatic replenishment slowing down
- Halcon 20.11: how to deal with the quality problem of calibration assistant
- Halcon 20.11: precautions for use of calibration assistant
- "Top ten scientific and technological issues" announced| Young scientists 50 ² forum
- Reverse linked list
- JS data type
- Remember the bug encountered in reading and writing a file
- Singleton mode
- 在这个 N 多编程语言争霸的世界，C++ 究竟还有没有未来？
- In this world of N programming languages, is there a future for C + +?
- js Promise
- js 数组方法 回顾
- ES6 template characters
- js Promise
- JS array method review
- 【Golang】️走进 Go 语言️ 第一课 Hello World
- [golang] go into go language lesson 1 Hello World