Computational challenges of training LLMs(训练LLM时的计算挑战)
One of the most common issues you still counter when you try to train large language models is running out of memory. If you've ever tried training or even just loading your model on Nvidia GPUs, this error message might look familiar. CUDA, short for Compute Unified Device Architecture, is a collection of libraries and tools developed for Nvidia GPUs. Libraries such as PyTorch and TensorFlow use CUDA to boost performance on matrix multiplication and other operations common to deep learning. You'll encounter these out-of-memory issues because most LLMs are huge, and require a ton of memory to store and train all of their parameters. Let's do some quick math to develop intuition about the scale of the problem.
A single parameter is typically represented by a 32-bit float, which is a way computers represent real numbers. You'll see more details about how numbers gets stored in this format shortly. A 32-bit float takes up four bytes of memory. So to store one billion parameters you'll need four bytes times one billion parameters, or four gigabyte of GPU RAM at 32-bit full precision. This is a lot of memory, and note, if only accounted for the memory to store the model weights so far. If you want to train the model, you'll have to plan for additional components that use GPU memory during training. These include two Adam optimizer states, gradients, activations, and temporary variables needed by your functions. This can easily lead to 20 extra bytes of memory per model parameter. In fact, to account for all of these overhead during training, you'll actually require approximately 6 times the amount of GPU RAM that the model weights alone take up. To train a one billion parameter model at 32-bit full precision, you'll need approximately 24 gigabyte of GPU RAM. This is definitely too large for consumer hardware, and even challenging for hardware used in data centers, if you want to train with a single processor.
What options do you have to reduce the memory required for training? One technique that you can use to reduce the memory is called quantization. The main idea here is that you reduce the memory required to store the weights of your model by reducing their precision from 32-bit floating point numbers to 16-bit floating point numbers, or eight-bit integer numbers. The corresponding data types used in deep learning frameworks and libraries are FP32 for 32-bit full position, FP16, or Bfloat16 for 16-bit half precision, and int8 eight-bit integers. The range of numbers you can represent with FP32 goes from approximately -3*10^38 to 3*10^38. By default, model weights, activations, and other model parameters are stored in FP32. Quantization statistically projects the original 32-bit floating point numbers into a lower precision space, using scaling factors calculated based on the range of the original 32-bit floating point numbers.
Let's look at an example. Suppose you want to store a PI to six decimal places in different positions. Floating point numbers are stored as a series of bits zeros and ones. The 32 bits to store numbers in full precision with FP32 consist of one bit for the sign where zero indicates a positive number, and one a negative number. Then eight bits for the exponent of the number, and 23 bits representing the fraction of the number. The fraction is also referred to as the mantissa, or significant. It represents the precision bits off the number. If you convert the 32-bit floating point value back to a decimal value, you notice the slight loss in precision. For reference, here's the real value of Pi to 19 decimal places. Now, let's see what happens if you project this FP32 representation of Pi into the FP16, 16-bit lower precision space. The 16 bits consists of one bit for the sign, as you saw for FP32, but now FP16 only assigns five bits to represent the exponent and 10 bits to represent the fraction. Therefore, the range of numbers you can represent with FP16 is vastly smaller from negative 65,504 to positive 65,504. The original FP32 value gets projected to 3.140625 in the 16-bit space. Notice that you lose some precision with this projection. There are only six places after the decimal point now. You'll find that this loss in precision is acceptable in most cases because you're trying to optimize for memory footprint. Storing a value in FP32 requires four bytes of memory. In contrast, storing a value on FP16 requires only two bytes of memory, so with quantization you have reduced the memory requirement by half.
The AI research community has explored ways to optimize16-bit quantization. One datatype in particular BFLOAT16, has recently become a popular alternative to FP16. BFLOAT16, short for Brain Floating Point Format developed at Google Brain has become a popular choice in deep learning. Many LLMs, including FLAN-T5, have been pre-trained with BFLOAT16. BFLOAT16 or BF16 is a hybrid between half precision FP16 and full precision FP32. BF16 significantly helps with training stability and is supported by newer GPU's such as NVIDIA's A100. BFLOAT16 is often described as a truncated 32-bit float, as it captures the full dynamic range of the full 32-bit float, that uses only 16-bits. BFLOAT16 uses the full eight bits to represent the exponent, but truncates the fraction to just seven bits. This not only saves memory, but also increases model performance by speeding up calculations. The downside is that BF16 is not well suited for integer calculations, but these are relatively rare in deep learning.
For completeness let's have a look at what happens if you quantize Pi from the 32-bit into the even lower precision eight bit space. If you use one bit for the sign INT8 values are represented by the remaining seven bits. This gives you a range to represent numbers from negative 128 to positive 127 and unsurprisingly Pi gets projected two or three in the 8-bit lower precision space. This brings new memory requirement down from originally four bytes to just one byte, but obviously results in a pretty dramatic loss of precision.
Let's summarize what you've learned here and emphasize the key points you should take away from this discussion. Remember that the goal of quantization is to reduce the memory required to store and train models by reducing the precision of the model weights. Quantization statistically projects the original 32-bit floating point numbers into lower precision spaces using scaling factors calculated based on the range of the original 32-bit floats. Modern deep learning frameworks and libraries support quantization-aware training, which learns the quantization scaling factors during the training process. The details of this process are beyond the scope of this course. But you've seen the key point here, that you can use quantization to reduce the memory footprint of the model during training. BFLOAT16 has become a popular choice of precision in deep learning as it maintains the dynamic range of FP32, but reduces the memory footprint by half. Many LLMs, including FLAN-T5, have been pre-trained with BFOLAT16. Lookout for a mention of BFLOAT16 in next week's lab.
Now let's return to the challenge of fitting models into GPU memory and take a look at the impact quantization can have. By applying quantization, you can reduce your memory consumption required to store the model parameters down to only two gigabyte using 16-bit half precision of 50% saving and you could further reduce the memory footprint by another 50% by representing the model parameters as eight bit integers, which requires only one gigabyte of GPU RAM. Note that in all these cases you still have a model with one billion parameters. As you can see, the circles representing the models are the same size. Quantization will give you the same degree of savings when it comes to training. However, many models now have sizes in excess of 50 billion or even 100 billion parameters. Meaning you'd need up to 500 times more memory capacity to train them, tens of thousands of gigabytes. These enormous models dwarf the one billion parameter model we've been considering, shown here to scale on the left.
As modal scale beyond a few billion parameters, it becomes impossible to train them on a single GPU. Instead, you'll need to turn to distributed computing techniques while you train your model across multiple GPUs. This could require access to hundreds of GPUs, which is very expensive. Another reason why you won't pre-train your own model from scratch most of the time. However, an additional training process called fine-tuning, which you'll learn about next week. Also require storing all training parameters in memory and it's very likely you'll want to fine tune a model at some point. To help you understand more about the technical aspects of training across GPUs, we've prepared an optional video for you. It's very detailed, but it will help you understand some of the options that exist for developers like you to train larger models. You should feel free to skip this video. But if you're interested in learning more, I hope you'll check it out.
当你尝试训练大型语言模型时,最常见的问题之一是内存不足。如果你尝试过在Nvidia GPUs上训练或仅仅是加载你的模型,这个错误信息可能看起来很熟悉。CUDA,即计算统一设备架构,是为Nvidia GPUs开发的一系列库和工具。像PyTorch和TensorFlow这样的库使用CUDA来提升深度学习中矩阵乘法和其他常见操作的性能。你会遇到这些内存不足的问题,因为大多数大型语言模型(LLMs)都非常庞大,需要大量内存来存储和训练它们所有的参数。让我们来做一些快速计算,以对问题规模有一个直观的了解。
一个单独的参数通常由32位浮点数表示,这是计算机表示实数的一种方式。你很快就会看到关于数字如何以这种格式存储的更多细节。一个32位浮点数占用四字节的内存。因此,要存储十亿个参数,你需要4字节乘以十亿个参数,或者在32位全精度下需要4GB的GPU RAM。这是大量的内存,而且请注意,这只考虑了存储模型权重所需的内存。如果你想训练模型,你还必须计划额外的组件,这些组件在训练期间使用GPU内存。这些包括两个Adam优化器状态、梯度、激活以及你的函数所需的临时变量。这很容易导致每个模型参数额外需要20字节的内存。实际上,为了考虑到训练过程中所有这些开销,你实际上大约需要模型权重单独占用的GPU RAM量的6倍。要在32位全精度下训练一个十亿参数模型,你需要大约24GB的GPU RAM。这对于消费者硬件来说确实太大了,即使对于数据中心中使用的硬件来说也是有挑战性的,如果你想用单个处理器进行训练的话。
你有什么选择可以减少训练所需的内存?你可以使用一种称为量化的技术来减少内存。这里的主要思想是通过将模型权重的精度从32位浮点数降低到16位浮点数或8位整数,从而减少存储模型权重所需的内存。在深度学习框架和库中使用的相应数据类型是FP32用于32位全精度,FP16或Bfloat16用于16位半精度,以及int8八位整数。你可以用FP32表示的数字范围大约从-3*10^38到3*10^38。默认情况下,模型权重、激活和其他模型参数都以FP32存储。量化通过使用基于原始32位浮点数范围计算出的比例因子,将原始的32位浮点数统计投影到较低的精度空间。
让我们来看一个例子。假设你想用不同的精度存储圆周率Pi到小数点后六位。浮点数作为一系列的0和1的位来存储。用FP32全精度存储数字的32位包括一位用于符号,其中0表示正数,1表示负数。然后是8位用于数字的指数,以及23位代表数字的小数部分。小数部分也称为尾数或有效数字。它代表了数字的精度位。如果你将32位浮点数值转换回十进制值,你会注意到精度上的轻微损失。作为参考,这里是圆周率Pi的真实值到小数点后19位。现在,让我们看看如果你将这个FP32表示的Pi投影到FP16,16位较低精度空间会发生什么。16位包括一位用于符号,如你在FP32中看到的,但现在FP16只分配5位来表示指数和10位来表示小数。因此,你可以用FP16表示的数字范围大大缩小,从负65,504到正65,504。原始的FP32值在16位空间中被投影到3.140625。注意,这次投影失去了一些精度。现在小数点后只有六位数字。你会发现,在大多数情况下,这种精度的损失是可以接受的,因为你试图为内存占用进行优化。以FP32存储一个值需要4字节的内存。相比之下,以FP16存储一个值只需要2字节的内存,所以通过量化,你已经将内存需求减少了一半。
人工智能研究社区已经探索了优化16位量化的方法。特别是一种数据类型BFLOAT16,最近成为了FP16的流行替代品。BFLOAT16是谷歌大脑开发的Brain浮点格式的简称,在深度学习中已成为热门选择。许多大型语言模型(LLMs),包括FLAN-T5,都是用BFLOAT16进行预训练的。BFLOAT16或BF16是半精度FP16和全精度FP32之间的混合体。BF16显著有助于训练稳定性,并得到如NVIDIA A100等新型GPU的支持。BFLOAT16通常被描述为截断的32位浮点数,因为它捕捉了完整32位浮点数的全部动态范围,但仅使用16位。BFLOAT16使用全部八位来表示指数,但将小数部分截断为仅7位。这不仅节省了内存,还通过加速计算提高了模型性能。其缺点是BF16不适合整数计算,但这在深度学习中相对罕见。
为了全面性,让我们看看如果你将Pi从32位量化到更低精度的8位空间会发生什么。如果你用一位表示符号,INT8值由剩余的7位表示。这为你提供了一个范围,可以表示从负128到正127的数字,不出所料,Pi在8位低精度空间中被映射为二或三。这将新的内存需求从原来的四字节减少到仅一字节,但显然导致了相当严重的精度损失。
让我们总结一下你在这里学到的内容,并强调你应该从这次讨论中带走的关键要点。记住,量化的目标是通过降低模型权重的精度来减少存储和训练模型所需的内存。量化通过使用基于原始32位浮点数范围计算出的比例因子,将原始的32位浮点数统计投影到较低的精度空间。现代深度学习框架和库支持量化感知训练,它在训练过程中学习量化比例因子。这个过程的细节超出了本课程的范围。但你已经看到了这里的关键点,你可以使用量化来减少训练期间模型的内存占用。BFLOAT16已成为深度学习中精度的热门选择,因为它保持了FP32的动态范围,但将内存占用减少了一半。许多LLMs,包括FLAN-T5,都是用BFOLAT16进行预训练的。留意下周实验室中提到的BFLOAT16。
现在,让我们回到将模型适配到GPU内存的挑战,并看看量化可能产生的影响。通过应用量化,你可以将存储模型参数所需的内存消耗减少到仅两GB,使用16位半精度节省了50%,你可以通过将模型参数表示为8位整数进一步减少50%的内存占用,这只需要1GB的GPU RAM。请注意,在这些情况下,你仍然有一个十亿参数的模型。正如你所见,代表模型的圆圈大小相同。量化在训练时会给你节省同等程度的内存。然而,现在许多模型的大小超过了500亿甚至1000亿个参数。这意味着你需要多达500倍的内存容量来训练它们,成千上万GB。这些巨大的模型使我们一直在考虑的十亿参数模型相形见绌,如图所示左侧按比例缩放。
随着模型规模超过几十亿参数,要在单个GPU上训练它们变得不可能。相反,你需要转向分布式计算技术,在多个GPU上训练你的模型。这可能需要访问数百个GPU,这是非常昂贵的。这也是为什么大多数时候你不会从头开始预训练自己的模型的另一个原因。然而,一个名为微调的额外训练过程,你将在下周了解。也需要在内存中存储所有训练参数,而且你很可能希望在某个时候对模型进行微调。为了帮助你更多地了解跨GPU训练的技术方面,我们为你准备了一个可选视频。它非常详细,但它将帮助你了解像你这样的开发人员训练更大模型时存在的一些选项。你可以随意跳过这个视频。但如果你有兴趣了解更多,我希望你会查看它。
文章评论