当前位置:网站首页>Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode

Implementation of multi GPU distributed training with horovod in Amazon sagemaker pipeline mode

2020-11-07 20:15:43 InfoQ

Original URL: https://aws.amazon.com/cn/blogs/machine-learning/multi-gpu-and-distributed-training-using-horovod-in-amazon-sagemaker-pipe-mode/

At present , We can use a variety of techniques to train deep learning models with a small amount of data , It includes transfer learning for image classification tasks 、 Small sample learning and even one-time learning , It can also be based on pre training BERT or GPT2 Models fine tune language models . however , In some application cases, we still need to introduce a lot of training data . for example , If the current image and ImageNet The images in the dataset are completely different , Or is the current language corpus only for specific areas 、 It's not a generic type , So it's very difficult for transfer learning to bring about the ideal model performance . As a deep learning researcher , You may need to try new ideas or approaches from scratch . under these circumstances , We have to use large datasets to train large deep learning models ; Without finding the best way to train , The whole process can take a few days 、 Weeks, even months .

In this paper , We'll learn how to do it together Amazon SageMaker Run many on a single instance of GPU Training , And discuss how to do it in Amazon SageMaker On the implementation of more efficient GPU And multi node distributed training .

Link to the original text :【https://www.infoq.cn/article/0867pYEmzviBfvZxW37k】. Without the permission of the author , Prohibited reproduced .

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢