Yang will introduce LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments for Batch training) optimizers, which can find more parallelism for deep learning.
Abstract: In the last three years, supercomputers have become increasingly popular in leading AI companies. Amazon built a High Performance Computing (HPC) cloud. Google released its first 100-petaFlop supercomputer (TPU Pod). Facebook made a submission on the Top500 supercomputer list. Why do they like supercomputers? Because the computation of deep learning is very expensive. For example, even with 16 TPUs, BERT training takes more than 3 days. On the other hand, supercomputers can process 10^17 floating point operations per second. So why don’t we just use supercomputers and finish the training of deep neural networks in a very short time? The reason is that deep learning does not have enough parallelism to make full use of thousands or even millions of processors in a typical modern supercomputer. There are two directions for parallelizing deep learning: model parallelism and data parallelism. Model parallelism is very limited. For data parallelism, current optimizers can not scale to thousands of processors because large-batch training is a sharp minimum problem. In this talk, I will introduce LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments for Batch training) optimizers, which can find more parallelism for deep learning. They can not only make deep learning systems scale well, but they can also help real-world applications to achieve higher accuracy.
Bio:Yang You is a PhD candidate at UC Berkeley Computer Science Division. His advisor is Professor James Demmel. Yang You's research interests include Parallel/Distributed Algorithms, High Performance Computing (HPC), and Machine Learning.
Events are free and open to the public unless otherwise noted.