2025-06-19 15:20:00
zhihaojia.medium.com
One of the most effective ways to reduce latency in LLM inference is to fuse all computation and communication into a single megakernel — also known as a persistent kernel. In this design, the system launches just one GPU kernel to execute the entire model — from layer-by-layer computation to inter-GPU communication — without interruption. This approach offers several key performance advantages:
- Eliminates kernel launch overhead, even in multi-GPU settings, by avoiding repeated kernel invocations;
- Enables software pipelining across layers, allowing the kernel to begin loading data for the next layer while computing the current one;
- Overlaps computation and communication, as a megakernel can simultaneously execute compute operations and inter-GPU communication to hide latency.
Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.
Can we automate this process through compilation? Motivated by this question, our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.
A key advantage of MPK is extremely low latency for LLM inference by eliminating kernel launch overhead and maximally overlapping computation, data loading, and inter-GPU communication across layers.
Figure 1 illustrates a performance comparison between MPK and existing LLM inference systems on both single- and multi-GPU configurations. On a single NVIDIA A100 40GB GPU, MPK reduces per-token decoding latency from 14.5 ms — as achieved by optimized systems like vLLM and SGLang — to 12.5 ms, approaching the theoretical lower bound of 10 ms (based on loading 16 GB of weights with 1.6 TB/s memory bandwidth).
Beyond single-GPU optimization, MPK fuses computation and inter-GPU communication into a single megakernel. This design enables MPK to maximally overlap computation and communication. As a result, the performance improvements of MPK over current systems increase with the number of GPUs, making it particularly effective for multi-GPU deployments.
The rest of this blog dives deeper into how MPK works:
- Part 1 introduces the MPK compiler, which transforms an LLM’s computation graph into an optimized task graph;
- Part 2 covers the MPK runtime, which executes this task graph within a megakernel to achieve high throughput and low latency.
The computation performed by a large language model (LLM) is typically represented as a computation graph, where each node corresponds to a compute operation (e.g., matrix multiplication, attention) or a collective communication primitive (e.g., all-reduce), and edges denote data dependencies between operations. In existing systems, each operator is generally executed via a dedicated GPU kernel. However, this kernel-per-operator execution model often fails to exploit pipelining opportunities, since dependencies are enforced at a coarse granularity — across entire kernels — rather than the actual data units.
The computation of an LLM is generally represented as a computation graph, where each node is a compute operator (e.g., matrix multiplication, attention) or a collective communication primitive (e.g., allreduce), and edges denote data dependencies between operators. Existing systems generally launch a dedicated GPU kernel for each operator. However, this kernel-per-operator approach often fails to exploit pipelining opportunities, since dependencies are enforced at a coarse granularity — across entire kernels — rather than the actual data units.
Consider a typical example: an allreduce operation following a matrix multiplication. In existing kernel-per-operator systems, the allreduce kernel must wait until the entire matmul kernel completes. In reality, though, each chunk of data for the allreduce only depends on a portion of the matmul output. This mismatch between logical and actual data dependencies limits the potential for overlapping computation and communication.
To address this issue, MPK introduces a compiler that automatically transforms the LLM’s computation graph into a fine-grained task graph. This task graph explicitly captures dependencies at the sub-kernel level, enabling more aggressive pipelining across layers.
In an MPK task graph:
- Each task (shown as a rectangle in Figure 2) represents a unit of computation or communication assigned to a single GPU streaming multiprocessor (SM).
- Each event (shown as a circle) represents a synchronization point between tasks.
- Each task has an outgoing edge to a triggering event, which is activated once all associated tasks complete.
- Each tasks also has an incoming edge from a dependent event, indicating the task can start execution as soon as the event is activated.
Task graphs allow MPK to uncover pipelining opportunities that would be missed in computation graphs. For example, MPK can construct an optimized task graph where each allreduce task depends only on the corresponding matmul task that produces its input — enabling partial execution and overlap.
In addition to generating an optimized task graph, MPK also automatically generates high-performance CUDA implementations for each task using the Mirage kernel superoptimizer. This ensures that each task runs efficiently on a GPU SM. (For more about the kernel superoptimizer, see this post.)
MPK includes an on-GPU runtime system that executes the task graph entirely within a single GPU megakernel, allowing for fine-grained control over task execution and scheduling without any kernel launches during inference.
To achieve this, MPK statically partitions all streaming multiprocessors (SMs) on a GPU into two roles: workers and schedulers. The number of worker and scheduler SMs is fixed at kernel launch time and matches the total number of physical SMs, avoiding any dynamic context switching overhead.
Workers
Each worker operates on an SM and maintains a dedicated task queue. It follows a simple but efficient execution loop:
- Fetch the next task from its queue.
- Execute the task (e.g., matrix multiplication, attention, or inter-GPU data transfers).
- Notify the triggering event upon task completion.
- Repeat.
This design ensures that workers remain fully utilized while enabling task execution to proceed asynchronously across layers and operations.
Schedulers
Scheduling decisions are handled by MPK’s distributed schedulers, each of which runs on a single warp. Because each SM can accommodate multiple warps, up to four schedulers can run concurrently per SM. Each scheduler maintains a queue of activated events. It continuously:
- Dequeues activated events whose dependencies are satisfied (i.e., all prerequisite tasks have completed).
- Launches the set of tasks that depend on the activated event.
This decentralized scheduling mechanism minimizes coordination overhead while enabling scalable execution across SMs.
Event-Driven Execution
Figure 3 illustrates MPK’s execution timeline. Each rectangle represents a task running on a worker; each circle represents an event. As a task completes, it increments the counter for its corresponding triggering event. When the event counter reaches a pre-defined threshold, the event is considered activated and is enqueued into a scheduler’s event queue. The scheduler then launches any downstream tasks that depend on this event.
This design allows for fine-grained software pipelining and overlap between computation and communication. For example:
- Matmul tasks can execute in parallel with attention tasks from different layers.
- Allreduce communication can begin as soon as partial matmul results are available.
Because all scheduling and task transitions occur within a single kernel context, the overhead between tasks is extremely low — typically just 1–2 microseconds — enabling efficient execution of multi-layer, multi-GPU LLM workloads.
Our vision for MPK is to make megakernel compilation both easy to use and highly performant. Currently you can compile an LLM into a megakernel with just a few dozen lines of Python code — mainly to specify the megakernel’s inputs and outputs. We’re excited about this direction, and there’s still much more to explore. Some of the key areas we’re actively working on include:
- Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.
- Handling workload dynamism. MPK currently builds a static task graph, which limits its ability to handle dynamic workloads such as mixture-of-experts (MoE) models. We’re developing new compilation strategies that allow MPK to support dynamic control flow and conditional execution inside megakernels.
- Advanced scheduling and task assignment: MPK unlocks a new level of fine-grained scheduling at the task level. While our current implementation uses simple round-robin scheduling to distribute tasks across SMs, we see exciting opportunities in advanced scheduling policies — such as priority-aware or throughput-optimized strategies — for use cases like latency-SLO-driven serving or hybrid batching.
We believe MPK represents a foundational shift in how LLM inference workloads are compiled and executed on GPUs, and we’re eager to collaborate with the community to push this vision forward.
To learn more about MPK and explore our code and documentation, please visit our project website: https://github.com/mirage-project/mirage.
We welcome feedback, contributions, and collaborations from the community!
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.