You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blogs/llm.md
+8-21Lines changed: 8 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
---
2
2
title: "Large Language Models"
3
3
date: 2024-04-03
4
+
show: false
4
5
---
5
6
6
7
## Decoding Strategies
@@ -65,8 +66,8 @@ Top k sampling is a decoding strategy used in language models to generate text.
65
66
In top p sampling we select the words with the highest probability until the cumulative probability reaches p. The probability of the remaining words is set to zero. The value of p is usually set to 0.9 or 0.95. This strategy is used to prevent the model from generating gibberish text by sampling from the words with very low probability.
66
67
67
68
68
-
<!--## Memory bound vs Latency bound layers -->
69
-
<!--Let's take the example of GPT OSS 120B model -->
69
+
## Memory bound vs Latency bound layers -->
70
+
Let's take the example of GPT OSS 120B model
70
71
71
72
<!-- Let's take the example of LLAMA 7B model with $n_{layers} = 32$ , $d_{embed} = 4096$ and $N_{vocab} = 50000$. We can approximate the total number of parameters in the model as follows:
72
73
@@ -158,42 +159,28 @@ $ R_i^T R_j = R_{j-i} $ is only dependent on the relative position of the words
158
159
159
160
### Batch Norm
160
161
161
-
- In Feed forward layer, output of a neuron is taken across the batch and normalized.
162
-
- For Image, 1 channel i.e. HxW output is taken and normed across batch.
163
-
- A running average is kept for the mean and variance of the output of the neuron or the image across the batch.
162
+
In Feed forward layer, output of a neuron is taken across the batch and normalized. For images, 1 channel i.e. HxW output is taken and normed across batch. A running average is kept for the mean and variance of the output of the neuron or the image across the batch.
164
163
165
164
#### Instance Norm
166
165
167
-
- Similar to BatchNorm (normalization done over a single channel) but only over only 1 image.
168
-
- Used to keep the sample features independent improving image variability.
169
-
- Not possible in Feed Forward since if no batch, only neuron is there.
170
-
- No need to keep running average
166
+
Similar to BatchNorm (normalization done over a single channel) but only over only 1 image. Used to keep the sample features independent improving image variability. Not possible in Feed Forward since if no batch, only neuron is there. No need to keep running average.
171
167
172
168

173
169
174
170
### Layer Norm
175
171
176
-
- Normed across the layer for 1 data sample, i.e. output of the Feed Forward network.
177
-
- For Image, normalize across all the channels of one data sample, same as instance norm across all channels.
178
-
- In transformer, if we have a tensor of B, N, D where B is the batch size, N is the number of tokens and D is the dimension of each token, then the normalization is done across the D dimension, i.e. the tokens don't interact with each other.
179
-
- Unlike Instance Norm and batch norm, it does element wise affine operation on the normalized output. This means that all the D values in a token have different learnable mean and variance
180
-
- No need to keep running average
172
+
Normed across the layer for 1 data sample, i.e. output of the Feed Forward network. For Image, normalize across all the channels of one data sample, same as instance norm across all channels. In transformer, if we have a tensor of B, N, D where B is the batch size, N is the number of tokens and D is the dimension of each token, then the normalization is done across the D dimension, i.e. the tokens don't interact with each other. Unlike Instance Norm and batch norm, it does element wise affine operation on the normalized output. This means that all the D values in a token have different learnable mean and variance. No need to keep running average.
181
173
182
174
183
175
### Group Norm
184
176
185
-
- Somewhere in between LN and IN, it assumes that some channels will have similar features which should be normalzed together instead of only 1 channel or all the channels. The groups to be normalized together are just the adjacent ones like of 32 channels, groups of 8 can be formed.
186
-
- Good for small batch sizes like $\in$ (1,8)
187
-
188
-
The H, W are flattend to show the 4D tensor in a 3D tensor
177
+
Somewhere in between LN and IN, it assumes that some channels will have similar features which should be normalzed together instead of only 1 channel or all the channels. The groups to be normalized together are just the adjacent ones like of 32 channels, groups of 8 can be formed. Good for small batch sizes like $\in$ (1,8) The H, W are flattend to show the 4D tensor in a 3D tensor
189
178
190
179

191
180
192
181
### RMSNorm
193
182
194
-
- Similar to LayerNorm but the input is only divided by the RMS of the input and not the mean and variance.
195
-
196
-
If the output of a Feed Forward layer is is $A = [a_1, a_2, \dots, a_n]$, then the output of the layer norm is:
183
+
Similar to LayerNorm but the input is only divided by the RMS of the input and not the mean and variance.If the output of a Feed Forward layer is is $A = [a_1, a_2, \dots, a_n]$, then the output of the layer norm is:
* Works on both Ampere and Hopper (Not optimized for Hopper)
108
+
* Missing support for features like Paged KV cache, not worth supporting -->
79
109
80
110
## Smooth Quant, llm.int8, AWQ
81
111
@@ -95,16 +125,63 @@ Smooth Quant is good for compute bound systems (high batch size) but edge infere
95
125
96
126

97
127
128
+
## FP8 (W8A8)
129
+
130
+
## NVFP4 (W4A4)
131
+
132
+
NVFP4 does the GEMM with A4 x W4 --> O16. We need to quantize both weights and activations to FP4 and a good approximate range for FP4 is `(-6, 6)`.
133
+
134
+
The scaling factor for quantization is `s = m / 6` where `m = max(W)` is the max value of the weight/activation block (We use a block size of 16, which means every 16 values share a scaling factor). We store `s` using `fp8 (e4m3)` ,giving us an average bits per weight of `(4 * 16 + 8) / 16 = 4.5`. (Using FP16/FP32 for the scaling factor will make it 5 bit and 6 bit quantization ...)
135
+
136
+
Storing vanilla `s` leads to a large drop in precision, therefore we scale `s = SF * m / 6` before storing it in FP8. The idea is to keep `s` close to 1, around which the precision of FP8 is the highest. To quantize the weights, we get the `outputScale = 1 / (s / SF)` and `x_4 = x * outputScale = x * SF / s`. To dequantize, we use `x = s / SF * x_4`.
137
+
138
+
In TRTLLM, the weights stored in FP4 along with `weight_scaling_factor_2 = SF` and `weight_scaling_factor = ` which is the blockwise fp8 scales. The activations we get from the previous layers are often in FP16 and hence we need to quantize them before the GEMM. Also the blockwise scaling factors `s` are calculated dynamically, i.e. we only have the `activation_scaling_factor (aka global_activation_scaling_factor)` before hand. We also compute `alpha = activation_scaling_factor * weight_scaling_factor_2`
139
+
140
+
The FP4 Tensor Core computes = `A = W4 * A4 * weight_scaling_factor * blockwise_activation_scaling_factor`. Followed by `A_16 = A * alpha`
141
+
142
+
## GPT OSS 120B example
143
+
144
+
Input GEMM (A4, B4 -> O16)
145
+
QKNorm (I16 -> O16)
146
+
147
+
Attention (QKV8 -> FP16) (How to convert from FP16 to FP8)??
148
+
CVT FP8 to FP4
149
+
OProj GEMM (A4, B4 -> O16)
150
+
RMS Norm (I16 -> O4)
151
+
152
+
Dense/MoE MLP Up(I4 -> O16)
153
+
Silu & Mul (I16 -> O4)
154
+
Dense/MoE MLP Down(I4 -> O16)
155
+
156
+
RMS Norm (I16 -> O4)
157
+
158
+
159
+
Repeat (replace Dense by MoE MLP)
160
+
161
+
162
+
## FP8 vs INT8
163
+
Qualcomm [whitepaper](https://www.qualcomm.com/news/onq/2023/04/floating-point-arithmetic-for-ai-inference-hit-or-miss) shows that the hardware implementation of the FP8 format is somewhere between 50% to 180% less efficient than INT8 in terms of chip area and energy usage. This is because of the additional logic needed in the accumulation of FP formats versus integer formats. This seems like a broad range, but the actual efficiency depends on many hardware design choices that vary greatly. A similar conclusion was reached recently by Microsoft and Meta: Floating-point arithmetic is just much less efficient than integer arithmetic.
164
+
165
+
This means that FP8 will have to be significantly more accurate than INT8 to be worthwhile from a hardware-efficiency perspective.
166
+
167
+
FP8 is only supported in H100 GPUs but storing approximations in fp8 can be accurate than vanilla int8 quantization. The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat which again are only used for storage and not for computation.
168
+
169
+
## Quantizing bias
170
+
171
+
Biases are not converted because to preserve the accuracy of a typical addmm operation, they must be converted with a scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely requires a very high bitwidth to avoid clipping.
- Intro to weight quantization:https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fintroduction-to-weight-quantization-2494701b9c0c
- Quantized linear layer: https://discuss.pytorch.org/t/understanding-quantized-linear-layer/154000
165
225
- GPTQ & bnb benchmarking by TheBloke: https://github.com/AutoGPTQ/AutoGPTQ/issues/49#issuecomment-1538065985
166
226
167
-
## Misc
168
-
169
-
### FP8 vs INT8
170
-
Qualcomm [whitepaper](https://www.qualcomm.com/news/onq/2023/04/floating-point-arithmetic-for-ai-inference-hit-or-miss) shows that the hardware implementation of the FP8 format is somewhere between 50% to 180% less efficient than INT8 in terms of chip area and energy usage. This is because of the additional logic needed in the accumulation of FP formats versus integer formats. This seems like a broad range, but the actual efficiency depends on many hardware design choices that vary greatly. A similar conclusion was reached recently by Microsoft and Meta: Floating-point arithmetic is just much less efficient than integer arithmetic.
171
-
172
-
This means that FP8 will have to be significantly more accurate than INT8 to be worthwhile from a hardware-efficiency perspective.
173
-
174
-
FP8 is only supported in H100 GPUs but storing approximations in fp8 can be accurate than vanilla int8 quantization. The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat which again are only used for storage and not for computation.
175
-
176
-
### Quantizing bias
177
-
178
-
Biases are not converted because to preserve the accuracy of a typical addmm operation, they must be converted with a scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely requires a very high bitwidth to avoid clipping.
0 commit comments