2024-09-07 01:44:55
www.lamini.ai
TL;DR
- LLM inference frameworks have hit the “memory wall”, which is a hardware imposed speed limit on memory bound code. That means LLM application developers don’t need to worry about evaluating all of the nuances of different frameworks. They just need to understand where the memory wall is for their system, pick a framework that gets close to it, and move on.
- Requests/s & token/s claims can be misleading. Based on the MLPerf benchmark, server and offline scenarios will have much higher requests/s than single stream. LLM developers should understand how requests/s is calculated when selecting an inference framework.
- There’s a lot of noise about inference optimizations like quantization and sparsity right now. While these are indeed two of the highest impact optimizations for deep learning, they should be used with caution. Aggressively pruning LLMs via quantization can significantly reduce their accuracy and you might be better off using a smaller model in the first place.
- We usually recommend using well validated published models in the format that they are published in. For example, Meta publishes Llama 3.1 8B in bfloat16 format without sparsity Mixtral publishes Mixtral-8x22B-v0.1 in bfloat16 with 8-way sparsely gated mixture of experts sparsity. Expert teams may be able to quantize or prune more aggressively, but we don’t recommend that most users do this given the high difficulty.
- We built our inference engine with these considerations in mind.
- We run on AMD MI250 and MI300 GPUs and Nvidia H100 GPUs so our Single Stream memory wall is 200 tokens/sec, 331 tokens/sec, and 209 tokens/sec respectively.
- We optimize for the MLPerf Server scenario because it offers the best throughput and is the most flexible. We can also do Single Stream and Offline.
- We do quantization in an opinionated way choosing the format the model was published in when hardware acceleration exists, i.e., bfloat16 on MI250 and float8 on MI300X.
- Lamini enables memory tuning LLMs, which turns any LLM into a massive mixture of experts (MoME). Our inference engine seamlessly loads experts and runs them with high performance.
In order to run an LLM, you need an inference framework. There are many popular and emerging inference frameworks such as vLLM, TensorRT-LLM, llama.cpp, and llm.c, to name a few. The model is fixed during inference, so inference frameworks typically compete based on performance — speed, latency and throughput — or optimization methods — quantization, sparsity, flash attention, KV-Cache paging, and speculative decoding. It’s important to note that accuracy is determined by the model and not the inference framework which is why these frameworks compete on performance, not accuracy. In this post, we’ll discuss why some performance comparisons may not be as useful as you might think.
LLM inference frameworks have hit the “memory wall”, which is a hardware imposed speed limit on memory bound code. It turns out that the transformer models underlying LLMs are memory bound in the decoding phase. The speed of the system memory determines performance when a program is memory bound. So different inference frameworks running on the same hardware platform with the same memory will get approximately the same performance.
Since all frameworks are governed by the same hardware speed limit, LLM application developers have a simple choice. Understand where the memory wall is for your system, pick a framework that gets close to it, and move on. Of course there is no lower bound on how inefficient a particular piece of software can be, so you need to make sure that your inference framework is well tuned, but the memory wall imposes an upper bound. Once you get close to the upper bound, you are done.
This leaves room for more research into breaking down the memory wall for LLM inference. However, a research breakthrough may require departing from one of the most popular and useful models in history — the transformer.
What is the memory wall?
The memory wall refers to the large and growing gap between processor speed and memory bandwidth. The memory wall is caused because the energy required to move data between a large capacity memory such as HBM3 and the matrix cores on a GPU such as MI300x is larger than the energy required to move data through the matrix core circuit.
Figure 1 shows the memory wall in action. The gray line tracking GPU speed in FLOPs is growing faster than the green line tracking memory bandwidth. Note that the scale is logarithmic, so the gap is much bigger and growing much faster than it looks.
Why is transformer decoding memory bound?
Transformers have been the most successful deep learning model of the last decade, even challenging convolutional neural networks (CNNs) in vision. They dominate the landscape of LLMs as evidenced by Llama 3.1, Claude 3.5, GPT4, Deepseek, and Mistral.
The key innovation in transformers is that they avoid the memory wall during training, but they still run straight into it for inference. During inference, transformers use an algorithm called autoregressive decoding. It means that the transformer must fully predict the previous token before it can begin working on the next one. For Llama 3.1 400B, it means that every single one of the 400 billion weights needs to be loaded from memory before the model can output each and every token. That operation of loading 400 billion weights makes LLM inference memory bound on most modern systems, e.g. GPUs with HBM.
How did transformers avoid the memory wall for training? Before transformers the most popular models for language models were recurrent neural networks (RNNs). RNNs are trained with the back propagation through time (BPTT) algorithm. The BPTT algorithm loads every single weight in the neural network for every single token in the training data. It needs to do this to determine how the next token in a sequence is influenced by the previous token. This runs straight into the memory wall because the weights of the RNN need to be reloaded from memory over and over again.
Transformers tore down the memory wall for training by using a different training algorithm called “teacher forcing” that replaced BPTT. In teacher forcing, a huge batch of tokens is loaded all at once and used to update the weights of the LLM. It avoids the memory wall because the processor can load a weight once and perform thousands of floating point ops on it. Researchers previously had tried teacher forcing for RNNs, but the accuracy dropped significantly because the RNN didn’t have any way of figuring out how the next token was influenced by the previous token other than back propagation.
The key to transformers is the insight that they could rely on attention (which is compute bound, not memory bound) to figure out how the next tokens were influenced by the previous tokens. So teacher forcing succeeded for transformers where it failed for RNNs.
Different types of memory
If LLM inference is bottlenecked on memory, it should be possible to accelerate inference by using higher bandwidth memory, right? Unfortunately, this is not the case because changes in memory technologies require changes in hardware. Therefore, LLMs are not influenced by the choice of inference framework or software level optimizations.
Inference systems from Cerebras and Groq have claimed huge speedups in tokens per second for single requests. These speedups are real, because these inference engines are using a different memory technology, SRAM.
Local SRAM is 100x lower energy (and therefore higher bandwidth) than DRAM. However, it is also much lower capacity. A typical CDNA 3 core on an MI300X has 512KB of Local SRAM. Across 38 cores (CUs) and 8 dies per MI300X GPU, that adds up to 152MB of Local SRAM. To fit a single Llama 3.1 70B model in Local SRAM, it would take 921 GPUs. So a leading edge GPU packs 1263x less SRAM than HBM.
The GroqChip packs more Local SRAM per processor, reaching up to 230MB. This means that a cluster of just over 600 GroqChips could serve up a single Llama 3.1 70B parameter model in bfloat16. The Cerebras waferscale engine 3.0 takes it further, packing 44GB of SRAM into a single processor. Cerebras could serve up Llama 3.1 70B with a cluster of 4 waferscale engines.
Most inference systems today use GPUs, which typically store the LLM parameters in a type of memory called high bandwidth memory (HBM), which is a standard defined by the JEDEC memory standards body. GPUs like MI300X or H100 include 192GB or 80GB of high bandwidth memory v3, which is enough to store 96 billion or 40 billion parameter LLMs in bfloat16 in a single GPU. As the name implies, HBM achieves higher bandwidth than most other types of DRAM by using 3-D Stacking and Through-Silicon-Vias. One way to improve bandwidth is to use more advanced HBM memory, e.g. moving from HBM3 to the upcoming HBM4. Another way is to build a system with more HBM modules, i.e. more GPUs.
You can greatly accelerate inference by changing the hardware. Adding more GPUs increases memory bandwidth. If you have enough GPUs that you get enough capacity to move the LLM weights into Local SRAM, you can get a huge speedup. Whether or not investing that many hardware resources (and $$) into a single model is worth it to drop inference latency is another story.
MLPerf Scenarios
The MLPerf inference benchmark is an industry standard that defines how to measure performance of deep learning applications, including LLM inference.
For LLM inference, we are typically thinking of a large GPU server that is processing LLM requests. The GPU server is large because the LLM is large. MLPerf defines four scenarios based on a survey of the most common uses of LLMs in industry.
The relevant scenarios for LLMs are typically single-stream, server, and offline.
Single stream
Single stream means that there is a single user interacting with the LLM. For example, if you bring up a LLM inference service on your system and one user talks to the LLM using a playground. That would be single-stream. Time to first token and tokens per second impact the user experience for this scenario. Time to first token impacts how long the user has to wait to see a response when they press enter. Tokens per second dictates how quickly the tokens are rendered to the screen. Tokens per second also affects how long an LLM agent takes to form a request and call a function or tool.
Single stream runs straight into the memory wall because the LLM is decoding one token at a time. So the inference framework has no choice but to load the entire LLM from memory on every token.
Server
Server means that there is a single inference system that is receiving requests from many users simultaneously. This scenario gives the inference system much more freedom. If it can batch the requests together, it can process one token for each request in the batch simultaneously. This increases the amount of computation per weight loaded from the model. If the batch size is big enough, typically hundreds or thousands of requests, the inference system can overcome the memory wall. This is why GPU systems will have much higher tokens per second in the server scenario.
Offline
The offline scenario is very similar to the server scenario. Many requests are processed at once. In fact, all of the requests are presented to the inference system immediately and it can process them in any way it sees fit. This enables extensive batching and straightforward scheduling. The offline scenario is almost always compute bound because the batch size can be chosen optimally.
If you look at the winning MLPerf submissions, inference systems perform best in the offline scenario, but the server scenario is close behind it. Both of these scenarios can avoid the memory wall, so they will have much higher tokens/second than the single-stream scenario. Because the difference is so huge, when an inference framework reports tokens/second result, you should carefully consider if it is reporting the server/offline or single-stream scenario.
From an LLM application developer perspective, there is a clear advantage to casting your application as a server or offline scenario instead of the single-stream scenario. In the Lamini framework, we provide high level APIs that automatically map LLM calls onto the server scenario, achieving 52x faster inference than vLLM using a single-stream API for several real pipelines.
Use quantization and sparsity with caution
But aren’t there hundreds of papers that talk about inference optimizations?
Don’t I need quantization, sparsity, flash attention, KV-Cache paging, speculative decoding, etc, etc, etc?
Quantization and Sparsity
Quantization and sparsity are two of the highest impact optimizations for deep learning, including LLMs. They do help mitigate the memory wall because they shrink the neural network and therefore require moving less weight data from memory during inference. We expect optimized inference frameworks to use them but they should be used with extreme caution as blindly quantizing or pruning LLMs can reduce their accuracy.
One baseline that is often ignored in quantization or pruning benchmarks is simply shrinking the model. For example, moving from Llama 3.1 400B to Llama 3.1 70B reduces the amount of memory that needs to be loaded from memory by 5.7x. Moving from bfloat16 to FP8 or INT8 cuts it by 2x. Llama 3.1 400B achieves a MMLU score of 88.6, while Llama 3.1 70B achieves 86.0. A quantization process that converts Llama 3.1 400B from bfloat16 to FP8 would need to take extreme care that it does not drop MMLU (and every other metric) by more than a point or two. Otherwise it would be better, and much easier, to simply use Llama 3.1 70B.
There is a huge validation effort that goes into the release of the best models, e.g. Llama 3.1. We usually recommend using well validated published models in the format that they are published in. In addition to standard metrics, they typically go through red team evaluations. For example, Meta publishes Llama 3.1 8B in bfloat16 format without sparsity Mixtral publishes Mixtral-8x22B-v0.1 in bfloat16 with 8-way sparsely gated mixture of experts sparsity. Expert teams may be able to quantize or prune more aggressively, but we don’t recommend that most users do this given the high validation difficulty. If you aren’t comfortable reproducing Meta’s MMLU metric for Llama 3.1, you probably shouldn’t be validating quantized or pruned models.
Attention Optimizations
In addition to the memory wall, transformers can also run into other bottlenecks. One of the biggest other issues is handling very long sequences. The fundamental problem is that the default attention algorithm requires O(N^2) computation and memory with the sequence length. For short sequences, this is not significant. But the O(N^2) can dominate in longer sequences.
A good inference framework will include these optimizations, which will affect the maximum sequence length that can be processed. However, these optimizations do not address the memory wall. Even if the computational cost of attention was driven to zero, the system would still need to load every single weight for every single token for autoregressive decoding.
Where is the memory wall?
How do you know if your inference framework is optimized well enough that it gets close to the memory wall? The Berkeley Roofline model is a simple and insightful visual performance model that we can use to find the memory wall. Using a quick calculation using the roofline model, we can find that the memory wall for the single stream scenario of running Llama 3.1 8B on a single MI250 GPU is 200 tokens per second. If we time our inference framework and we get close to that, we have hit the hardware limit and additional software optimization will not be effective.
In the decoding phase of the LLM, the arithmetic intensity is approximately equal to the batch size. The turning point is a batch size of 113 for MI250, 492 for MI300X, and 567 for H100.
Consider running Llama 3.1 8B in a single stream scenario, which would have a batch size of 1. Every token requires loading 8 billion weights from memory. In bfloat16, each weight takes 2 bytes. On a MI250, memory bandwidth is 3.2 TB/s. So we could run the decoder at a frequency of 200 tokens per second = 3.2e12 (bytes per second) / (8e9 weights) * (2 bytes per weight).
Is it possible to tear down the memory wall?
The success of transformers and teacher forcing in training begs the question of whether it is possible for a new model to tear down the memory wall for inference. It’s been 7 years since Ashish Vaswani et al. published the groundbreaking “Attention Is All You Need” paper that introduced the transformer. All of that time we have been living with the memory wall implied by autoregressive decoding.
So far researchers have come up short.
Here we review several potential research directions that are aimed at breaking down the memory wall. A breakthrough in these or another research area could lead to inference engines that run hundreds to thousands of times faster.
Quantization and Sparsity Optimization
We expect future advances in quantization and sparsity to chip away the memory wall, but they are unlikely to provide the 567x reduction in network size needed to break the memory wall for H100. We have been holding out for a breakthrough in sparsity for over a decade, yet cutting edge models like Llama 3.1 are still released without any sparsity. State of the art models are often released in bfloat16 today. New formats like FP8 and FP4 are out now and there is a roadmap down to one bit and beyond, but it is expected to take several generations of hardware to work out how to train models reliably in these formats.
Mixture of expert models have provided some success. Lamini released the MoME model with up to one million experts. This cuts the number of active parameters when used for memory operations by one million times, enough to overcome the memory wall. Mixtral has a family of 8 way MoE models which cut down on the number of active parameters by up to 8x.
Parallel and Speculative Decoding
Another area of research is to make LLMs decoding more parallel, effectively removing autoregressive decoding from transformers.
Speculative decoding takes the sequential autoregressive decoder, and runs it in parallel anyways. It uses a small neural network with fewer parameters to predict most tokens, and only calls in the bigger LLM to check it’s work and correct it when it makes a mistake. Speculative decoding mitigates the memory wall, but typically models can only speculate a few tokens ahead without making a mistake, limiting the benefit. The small neural network is still bound by the memory wall, it just has fewer weights than the bigger network. Care also needs to be taken with speculative decoding to make sure that errors are not introduced into the decoding process by the predictions made by the small network.
Parallel decoding changes the transformer architecture so that it can output more than one token at a time. This allows reusing the weights for multiple tokens. If we found an approach that could output thousands of tokens at a time, it would break the memory wall. Changing the decoder head can allow generating 4 to 8 tokens at a time as described in the paper “Better & Faster Large Language Models via Multi-token Prediction“. Diffusion forcing combines a wavefront algorithm together with a diffusion process to output the entire output sequence simultaneously as described here which could potentially generate an unlimited number of tokens in one shot. Although some of these approaches are promising, they are new models, and so may require training new foundation models to be competitive with leading transformers like Llama 3.1. It would be an expensive mistake to try training one of these models only to realize that it is more parallel, but less accurate than a transformer.
In summary, it will take a lot more research to tear down the memory wall. Until the next breakthrough comes along, be wary of claims from inference frameworks promising top performance as they may be trading off accuracy for speed or comparing against unoptimized baselines.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.