当前位置:网站首页>[deep learning] - Summary of common distributed training technologies

[deep learning] - Summary of common distributed training technologies

2020-12-08 14:44:52 ONEFLOW deep learning framework

【 Deep learning 】— Summary of common distributed training techniques

summary

Distributed 、 High concurrency 、 Multithreading , It seems that a programmer can never escape 3 Key words , As long as it's off the stand-alone / A single node , involves 2 More than machines , You're going to run into distributed . The same goes for deep learning , When you have massive amounts of data / The training needs of huge models , Even if you have one 8 block TESLA V100 Server for , Still not enough , To speed up training , Naturally expand to 2 machine 、4 Distributed training of machine nodes and even more …
CV field , In order to train ImageNet To minimize the time of , Tencent once worked in the team 2018 year , Used 2048 block Tesla P40, take ResNet50 stay ImageNet The training time on is compressed to 6.6 minute , See the paper for details 《Highly Scalable Deep Learning Training System withMixed-Precision: Training ImageNet in Four Minutes》, Zhihu also has relevant reports :4 Minutes of training ImageNet! Tencent is smart to create AI Training world record
![tm.png](https://img-blog.csdnimg.cn/img_convert/954d9339bbaccc422b5e0637cd7107aa.png#align=left&display=inline&height=733&margin=[object Object]&name=tm.png&originHeight=733&originWidth=1105&size=175836&status=done&style=none&width=1105)
NLP field , More big models come out , Such as BERT、GTP series . For training GPT-2 Model , It was used 256 individual Google Cloud TPU v3, It is said that GPT-3 It's a waste of training N Multigraphics and 1200 Ten thousand dollars , There are also related articles on Zhihu : How to evaluate 1700 Billion parameter GPT-3?
Martial arts in the world , Fast break not only , To be quick , We have to take the path of distributed training . Now? , Each deep learning framework basically supports the distributed training of deep learning model , So here comes the question : What technologies and frameworks are used in distributed training of deep learning , What's the principle ? How easy is the distributed training of each framework , Which is better in training , How about the speedup ratio ?

This series of articles will answer and summarize the above questions , It is right to draw a brick to draw a stone , Welcome your attention and exchange ! among ** This paper focuses on some common technologies and concepts in the field of deep learning distributed training ; The next article will focus on sorting out the distributed interfaces of each framework , Use method and evaluation speed comparison .** If there are omissions and deficiencies , Please give me more advice . Last ,** Amway, let's talk about our recent work ——DLPerf Warehouse .** It contains the speed evaluation of the above frameworks , And detailed ReadMe, So you can easily reproduce 、 Distributed multi machines running in different frameworks .

1. Data parallelism or Model parallel

Usually , There are two ways to train distributed model in deep learning field :

  • Data parallelism (Data Parallel)
  • Model parallel (Model Parallel)

Data parallelism diagram

stay Data parallelism in , Segment the sample data , The split data Sent to training nodes , And The complete model Carry out operations , Finally, the information of multiple nodes is merged , As shown in the figure below :
![](https://img-blog.csdnimg.cn/img_convert/0fcf954f2680319ad79df08165c078d7.png#align=left&display=inline&height=508&margin=[object Object]&originHeight=508&originWidth=1720&size=0&status=done&style=none&width=1720)

Model parallelism diagram

stay Model parallel in , Divide the model , Complete data Sent to training nodes , And The segmented model Carry out operations , Finally, the operation results of multiple nodes are combined , As shown in the figure below :
![](https://img-blog.csdnimg.cn/img_convert/2aa37c33638c628028110388d62ab05d.png#align=left&display=inline&height=522&margin=[object Object]&originHeight=522&originWidth=1785&size=0&status=done&style=none&width=1785)

Grey means data , Blue is the model

1.1 Data parallelism

What is data parallelism ?

With GPU From the dimension of , Data parallelism is simply put in parallel training equipment , The whole training data is divided into several parts , Within the same training interval , Different GPU On the equipment, the model is trained with its own piece of data , After that, the model gradient is updated and updated GPU State synchronization between . The result of this is within a training interval , each GPU The device can In parallel, the model is trained with the data of each piece , This greatly speeds up the training of the whole model .
![ Data parallelism .png](https://img-blog.csdnimg.cn/img_convert/e9602bb39290f34b99fffa5527c0a590.png#align=left&display=inline&height=623&margin=[object Object]&name= Data parallelism .png&originHeight=623&originWidth=1027&size=67432&status=done&style=none&width=1027)
As can be seen from the above figure , When data is parallel , Every GPU The same model data is maintained on the device , And a complete training process includes the following 3 Step :
1.CPU Responsible for transferring different training data (mini-batch) Feed them separately GPU0 and GPU1 equipment ;
2. Identical models are stored on different graphics devices , adopt mini-batch The data propagates forward and backward ;
3. In a different place GPU The weight of the model on the device is synchronized and updated

Why data parallelism is needed ?

Simply speaking , The world's martial arts are only fast , As we get bigger and bigger , It takes longer and longer to train a complete model , To speed up training , We don't usually use a single GPU Equipment for training , It's a single multi card system (GPU)、 Multi machine multi card model training . That is, we want to use distributed training , Reduce training time by expanding the number of equipment , Achieve a near linear acceleration ratio .
Take a chestnut ? It is known that :
Suppose the training set is 128 Ten thousand ImageNet; The model is ResNet50; single GPU Video memory can support the largest batch size by 128;
iteration 1 individual batch( Complete data loading + Forward direction + Reverse gradient update ) The time required is 7.2 second , And GPU The more devices, the faster ( Linear acceleration )
seek :
Single GPU Training 1 individual epoch、100epoch Time required ? stand-alone 8 card (GPU) Well ?4 machine ×10 Khani ?
answer :
To complete a epoch The time required is 1280000/128 * 7.2 = 72000(s), namely 20 Hours , iteration 100epoch, In the leaflet GPU The time it takes to get on is 20*100=2000 Hours , If you use a single GPU How to train the model , It's going to take time 3 Months ( All my hair is gone ~)

1 machine 8 card , In theory, the training time is only the original 1/8; If you use 4 platform 8 Card machine ( Multi machine multi card data parallel ) Training , Then the time is shortened to the original 1/32, namely 2000/32=62.5 Hours ( It's still acceptable ).

Most of the time , We use Data parallelism To train distributed models . In this way , We hope that through more GPU Or machine nodes to speed up model training , The goal is It's through Horizontal expansion , To achieve the ideal situation of Linear acceleration ratio .

1.2 Model parallel

What is model parallelism ?

Model parallelism is similar to data parallelism , Different network layers of the whole model ( Or a certain layer ) The parameter matrix is divided into different GPU On the device , The process of model training .
![ Model parallel .png](https://img-blog.csdnimg.cn/img_convert/1ee076e02ae5c0e77ae02c649789505a.png#align=left&display=inline&height=598&margin=[object Object]&name= Model parallel .png&originHeight=598&originWidth=993&size=64953&status=done&style=none&width=993)
It can be seen from the figure above , When the model is parallel , The complete model network is segmented into different devices :GPU0 and GPU1, And the training process is divided into the following steps :
1.mini-batch Give it to GPU0;
2. The data is in GPU0 Forward process on the network ;
3. The data from the previous step continues to feed GPU1 And in GPU1 Continue to move forward on the network of ;
4.GPU1 Reverse ;
5. Reverse data back to GPU0, Go ahead and reverse ;
so , When the model is parallel , It is not necessary to update the model weight parameters synchronously on each device , But there will be intermediate data in each GPU Flow between models on .

Why model parallelism is needed ?

In a few cases , The scale of the model is particularly large , There are so many parameters that a single GPU I can't fill up my video memory ( For example, some classification networks / Face model due to num_classes Particularly large , Lead to the end FC The total connection layer has a huge number of parameters ), So we can only go through Model parallel The way of training , That is, the parameter matrix of each network layer or even a certain layer of the model is divided into several pieces GPU Training on .

Of course , In addition to the common data parallelism 、 Besides model parallelism , And data - Model mixing and parallel , Pipelining and other parallel methods , This series of articles focuses on distributed data parallelism model training .

2. Distributed collective communication (Collective communication)

What is collective communication (Collective communication)? To talk about collective communication , First of all, understand P2P Point to point communication (Point-to-point).P2P Communication is usually between two different processes , yes 1 Yes 1 Of ; Corresponding , Collective communication is 1 To many or many to many . In distributed systems , There is a large number of collective communication requirements between nodes , And we can use Messaging interface (Message Passing Interface, MPI) To define some lower level message communication behaviors, such as Reduce、All reduce、Scatter、Gather etc. .

MPI A brief history of

stay 90 S before , Programmers are not as lucky as we are . It is difficult and tedious to write concurrent programs for different computing architectures . at that time , Many software libraries can help write concurrent programs , But there's no universally accepted standard for doing this .
At the time , Most concurrent programs only appear in the field of science and research . The most widely accepted model is the messaging model . What is the messaging model ? It just means that the program passes messages between processes ( A message can be understood as a data structure with some information and data ) To accomplish certain tasks . In practice , Concurrent programs are particularly easy to implement with this model . for instance , The main process (manager process) It can be done by following the process (worker process) Send a message describing the job to assign the job to it . Another example is a concurrent sorter that can be visible to the current process in the current process ( We call it local ,locally) Sort the data , Then the sorted data is sent to the neighbor process for merging operations . Almost all parallel programs can be described using a messaging model .
At that time, many software libraries used this messaging model , But there are slight differences in definition , The authors of these libraries and others are trying to solve this problem in Supercomputing 1992 The conference defined a standard for messaging interfaces - That is to say MPI. This standard interface enables concurrent programs written by programmers to run in all mainstream concurrency frameworks . And allow them to use the features and models of some popular libraries that are already in use at the time .
To 1994 In the year , A complete interface standard is defined (MPI-1). We need to remember MPI It's just _ It's just the definition of an interface . Then it needs programmers to implement the interface according to different architectures . Fortunately , Just a year later , A complete MPI Implementation has already appeared . After the first implementation ,MPI It's widely used in messaging applications , And still write this kind of program _ standard (de-facto).
![](https://img-blog.csdnimg.cn/img_convert/137edb64f6c781bd14cf84a386e4e2c5.png#align=left&display=inline&height=246&margin=[object Object]&originHeight=246&originWidth=320&size=0&status=done&style=none&width=320)
The first batch of MPI A true portrayal of a programmer

Quote from :https://mpitutorial.com/tutorials/mpi-introduction/zh_cn/

MPI As a veteran and communication standard in the field of High Performance Computing , A series of communication interfaces are defined , Its upper layer can be realized by a variety of programming voice ( Such as c/c++、fortran、java…), There are some popular communication library implementations such as :MPICH2、OpenMPI, These communication libraries use different code / The algorithm is implemented MPI The interface defines the communication mode of . One of the common communication modes is :

  • Send
  • Receive
  • Broadcast
  • Scatter
  • Gather
  • Reduce
  • All reduce

below , Let's explain some of the communication patterns through the text , The illustrations come from :https://mpitutorial.com/tutorials/

2.1 Collective communication

Usually , As an Algorithm Developer , You just need to know the top layers that each framework provides api, Distributed model training is enough ; But as a Framework Developer , Learn about the common collective communication modes / Algorithm , It's absolutely necessary , Because common distributed communication libraries such as OpenMPINCCL etc. , It's all based on MPI The interface implements a series of algorithms , So that the nodes in the distributed situation can communicate and transmit data quickly .

Send&Receive

stay MPI Message sending with synchronous blocking in 、 Receiving interface , Such as :MPI_send and MPI_Recv, There are also non blocking (nonblocking) Interfaces such as MPI_Isend and MPI_Irecve. These interfaces define P2P The method of sending and receiving in communication mode . As MPI The most basic communication interface in , The bottom layer can be transmitted by different communication protocols ( It can also be designated ).

It should be noted that ,Send&Receive Belong to P2P signal communication , The reason why they are introduced separately is that they are very basic , Many collective communications can be implemented through Send&Receive To complete .

Broadcast&Scatter

Circles represent independent nodes in a distributed system ( process ), As above, 0~3, common 4 Nodes ; The little squares represent the data , The same color means the same data .
![image.png](https://img-blog.csdnimg.cn/img_convert/4369e3b0e3b4c63ec73114e6db539587.png#align=left&display=inline&height=340&margin=[object Object]&name=image.png&originHeight=340&originWidth=287&size=22405&status=done&style=none&width=287)

broadcast Represents an act of broadcasting , perform broadcast when , Data is broadcast from the master node to each other designated node ; and broadcast similar ,scatter It means a kind of dissemination , Distribute the data partition of the master node to other specified nodes .

Gather

![image.png](https://img-blog.csdnimg.cn/img_convert/480f4f26b4b24ffa8e35636d80f1c2ce.png#align=left&display=inline&height=154&margin=[object Object]&name=image.png&originHeight=154&originWidth=280&size=10831&status=done&style=none&width=280)
gather Behavior and scatter Act the opposite , The corresponding is collection , perform gather The node will collect the corresponding data from other specified nodes .

All gather

![image.png](https://img-blog.csdnimg.cn/img_convert/704a0b1e7539348483cc778dd91ecb3f.png#align=left&display=inline&height=169&margin=[object Object]&name=image.png&originHeight=169&originWidth=211&size=12762&status=done&style=none&width=211)
all gather It's an enhanced version of gather, Will cause each node to execute once gather Behavior

Reduce

reduce It's called a canonical operation , Is a general term for a series of operations , In terms of subdivision, it includes SUM、MIN、MAX、PROD、LOR etc. .reduce To reduce / Streamlining , Because its operation gets an array of input elements on each process , After performing the operation by , You'll get fewer elements that are streamlined . For example, the following Reduce sum

![image.png](https://img-blog.csdnimg.cn/img_convert/94a8d8585c413e515c2c3fcc18cfb330.png#align=left&display=inline&height=222&margin=[object Object]&name=image.png&originHeight=222&originWidth=505&size=15791&status=done&style=none&width=505)
![image.png](https://img-blog.csdnimg.cn/img_convert/6bb3cc376100068e573b2159f8421803.png#align=left&display=inline&height=222&margin=[object Object]&name=image.png&originHeight=222&originWidth=505&size=16702&status=done&style=none&width=505)

All reduce

reduce Is a general term for a series of operations ,all reduce The same is applied to all node processes reduce operation .
All reduce sum:
![image.png](https://img-blog.csdnimg.cn/img_convert/3b2023e75f73bd3a158e6af228119ead.png#align=left&display=inline&height=222&margin=[object Object]&name=image.png&originHeight=222&originWidth=505&size=17524&status=done&style=none&width=505)
As you can see from the diagram ,all reduce The operation can be performed on a single node reduce+broadcast Operation is completed .

2.2 Communication library

Open MPI

Use the official website to describe :Open MPI The project is an open source MPI( Messaging interface ) Realization , By academic , Research and industry partner alliance development and maintenance . therefore ,Open MPI Can integrate all the experts in the high performance computing community , Technology and resources , To build the best available MPI library .

Gloo

Gloo yes facebook Open source a set of collective communication library , He provides some set communication algorithms that are useful in machine learning, such as :barrier, broadcast, allreduce

NCCL

NCCL It's NVIDIA based on NCIDIA-GPU A set of open source collective communication library , As described on the official website :NVIDIA Collective communication library (NCCL) To achieve the goal of NVIDIA GPU More performance optimization GPU And multi node collective communication primitives .NCCL Provides information such as all-gather, all-reduce, broadcast, reduce, reduce-scatter Such as implementation , These can be optimized by PCIe and NVLink Equal high speed interconnection , So as to achieve high bandwidth and low latency .

because MPI For a general distributed environment , and NCCL It is NVIDIA Customized based on its own hardware , Can achieve more targeted and more convenient optimization , So on NVIDIA hardware ,NCCL The effect is often better than the traditional MPI Better .

Horovod

Horovod It's not a communication library , A set packaged with the underlying communication library , Framework for distributed training for deep learning .
Horovod yes Uber Open source , in the light of TensorFlow,Keras,PyTorch and Apache MXNet Distributed deep learning training framework of Horovod The goal is to make distributed deep learning fast and easy to use , Its underlying support mpi、gloo perhaps nccl Data communication . Usually , stay nvidia-gpu On , Use horovod+nccl The combination of , It can make the distributed training of deep learning achieve high performance and speedup ratio , Even though Horovod The emergence of distributed training has greatly facilitated in-depth learning , But support horovod, In addition to the need to install mpi、nccl Wait outside the communication library , Or need to manually change some model training code .

3. Distributed training and All reduce

3.1 The relationship between them ?

![ Data parallelism .png](https://img-blog.csdnimg.cn/img_convert/e9602bb39290f34b99fffa5527c0a590.png#align=left&display=inline&height=623&margin=[object Object]&name= Data parallelism .png&originHeight=623&originWidth=1027&size=67432&status=done&style=none&width=1027)
stay 1.1 section , We learned about a distributed deep learning ( Data parallelism ) The main process of training is roughly divided into 3 Step :

  • ** Data partitioning **

      Different GPU The device is divided into different mini-batch, As a data set for training 
    
  • ** Forward direction + reverse **

      Different GPU The same model is used on the device , With each received mini-batch Data for training ( Forward and backward propagation )
    
  • Gradient synchronization update

      Every GPU The device got mini-batch Weight value after training , These values need to be aggregated and then updated to each GPU equipment , Make sure that after each iteration , Every GPU The models on the device are exactly the same .
    

We can see , The first 3 Step gradient synchronous update includes collecting gradients from each node 、 Summary 、 The whole process of updating to each node , All of these things come together to be a all reduce The process of , To be more specific, it is all reduce sum operation . adopt all reduce sum, After the gradient values of each node are summarized , Then update to each node . thus it can be seen , Distributed training for deep learning and all reduce The relationship is very close .

3.2 All reduce Na Jiaqiang ?

3.2.1 OpenMPI

MPI In the realization of , There are all kinds of all reduce Algorithm , In the latest OpenMPI-4.0.5 In the code of (openmpi-4.0.5/ompi/mca/coll/tuned/coll_tuned_allreduce_decision.c), We can see that there is 7 Different species all reduce Algorithm implementation :

{
   0, "ignore"},
    {
   1, "basic_linear"},
    {
   2, "nonoverlapping"},
    {
   3, "recursive_doubling"},
    {
   4, "ring"},
    {
   5, "segmented_ring"},
    {
   6, "rabenseifner"},

![all-reduce-alg.png](https://img-blog.csdnimg.cn/img_convert/d44df1fa75f63f65d7e3abd531ffc37e.png#align=left&display=inline&height=953&margin=[object Object]&name=all-reduce-alg.png&originHeight=953&originWidth=941&size=177126&status=done&style=none&width=941)
In the distributed training environment of deep learning ,ring all reduce The algorithm is excellent , Can make full use of node bandwidth , Reduce the time . A more specific comparison and analysis of these algorithms , You can refer to : Tencent smart team share –AllReduce The past and present of algorithms

3.3.2 NCCL All reduce

Ying Wei Da Yu 2015 Published in NCCL, An open source 、 Realization of closed source collective communication library based on its own hardware . The basic implementation principle of the algorithm , and mpi The implementation of is basically similar to , Because it's based entirely on your own hardware , Can be fully optimized , So based on the nvidia-gpu when , Use nccl The performance is very strong , It's delicious !

NCCL VS MPI

NCCL Followed MPI, A series of communication methods with different names but similar functions are defined , in the light of GPU Some commonly used interfaces have been greatly optimized , And the others didn't come true , It's kind of like MPI One of them is right GPU A strong subset of communication support , But it didn't work MPI Unified interface . We go through NCCL Official documents Overview and NCCL and MPI It can be seen that :

1.NCCL It's an implementation of many GPU A library of inter collective communication primitives
These libraries have topology aware capabilities , Easy to integrate into applications .NCCL The collective communication algorithm of the system uses many coordinated processors to aggregate data ;

2.NCCL It's not a mature collective communication framework , It's more like a lib library
Some source languages used to implement and accelerate collective communication ,NCCL The following collective communication operations are currently supported :

  • AllReduce
  • Broadcast
  • Reduce
  • AllGather
  • ReduceScatter

3.NCCL It can be relaxed with MPI Use a combination of .
NCCL Be similar to MPI, therefore , from MPI Create in communication implementation NCCL Communication implementation is very simple . therefore , It's easy to put MPI be used for CPU To CPU Communication for , take NCCL be used for GPU To GPU Communication for ( however , stay MPI Used in program NCCL when ,MPI Some implementation details in can cause problems , For example, deadlock ).

3.3.3 Baidu Ring Allreduce

Although in MPI In various implementations of ( for example OpenMPI), There's a long way to go Ring All reduce Algorithm , But introduce it into deep learning , It was created by Baidu . Baidu 2016 In the paper :Bringing HPC Techniques to Deep Learning This paper introduces a concept from high performance distributed computing ——Ring All reduce, And introduce it into deep learning ( to tensorflow Contributed code , Added based on mpi The source language realizes ring all reduce), And has achieved a significant performance improvement .
May refer to You know :[ translate ] Bringing HPC Techniques to Deep Learning
Code :https://github.com/baidu-research/baidu-allreduce

3.3.4 other All reduce

  • be based on double binary tree Of all reduce

Two-Tree Algorithms for Full BandwidthBroadcast, Reduction and Scan
double binary tree On 2009 In MPI Introduction in , And then in NCCL2.4 This implementation is also introduced in :https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/#ref3

  • layered ring all reduce

《Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes》

  • be based on spanning tree Of all reduce

《Blink: Fast and Generic Collectives for Distributed ML》

版权声明
本文为[ONEFLOW deep learning framework]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201208144444898p.html