• About TC
  • Affiliate Disclaimer
  • Privacy Policy
  • TOS
  • Contact
Thursday, June 19, 2025
Techcratic
  • TC
  • AI
    Artificial Intelligence

    Update on the AWS DeepRacer Student Portal

    Artificial Intelligence

    INRFlow: Flow Matching for INRs in Ambient Space

    Artificial Intelligence

    Building a custom text-to-SQL agent using Amazon Bedrock and Converse API

    Artificial Intelligence

    How Apollo Tyres is unlocking machine insights using agentic AI-powered Manufacturing Reasoner

    Artificial Intelligence

    Automatically Build AI Workflows with Magical AI

    Artificial Intelligence

    Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

  • Crypto
    Moscow Releases Tax Calculator for Russian Crypto Miners

    Russian Power Firm Debuts Bitcoin Mining Mutual Investment Fund

    Blockchain Lending Boom: Tokenized Private Credit Nears $14 Billion

    Blockchain Lending Boom: Tokenized Private Credit Nears $14 Billion

    Pi Network Dives Toward $1 – Here’s Why Investors Are Nervous

    Trump Coin Price Prediction: TRUMP Gets Binance Backing

    Crypto Firms on Wall Street Diverge: Miners Dip, Strategy Clones Vary

    Crypto Firms on Wall Street Diverge: Miners Dip, Strategy Clones Vary

    Bitcoin Volume Metric Hints A ‘$130k-$135k Btc Will Happen’ By Q3 2025

    Bitcoin Volume Metric Hints A ‘$130k-$135k Btc Will Happen’ By Q3 2025

    Best Presales to Buy Today – Which Coins Are Poised for a Breakout?

    DOGE, SOL, and XRP Holders Can Now Cash Out Through BCC Mining

    China’s Central Bank Governor Envisions the End of US Dollar Dominance

    China’s Central Bank Governor Envisions the End of US Dollar Dominance

    K33 Plans Major Bitcoin Acquisition With $8 Million Share Issue

    K33 Plans Major Bitcoin Acquisition With $8 Million Share Issue

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    Trump Wants GENIUS Bill Sent Straight to His Desk, Without Delay

  • Cybersecurity
    Cybersecurity

    Meta Adds Passkey Login Support to Facebook for Android and iOS Users

    Cybersecurity

    FedRAMP at Startup Speed: Lessons Learned

    Cybersecurity

    CISA Warns of Active Exploitation of Linux Kernel Privilege Escalation Vulnerability

    Cybersecurity

    Ex-CIA Analyst Sentenced to 37 Months for Leaking Top Secret National Defense Documents

    Cybersecurity

    Critical RCE Bug Rated 9.9 CVSS in Backup & Replication

    Cybersecurity

    Hard-Coded ‘b’ Password in Sitecore XP Sparks Major RCE Risk in Enterprise Deployments

    Cybersecurity

    AI Agents Run on Secret Accounts — Learn How to Secure Them in This Webinar

    Cybersecurity

    How to Address the Expanding Security Risk

    Cybersecurity

    ConnectWise to Rotate ScreenConnect Code Signing Certificates Due to Security Risks

  • Deals
    StarTech.com 4 Port PCI Express USB 3.0 Card – 5Gbps – 2 External & 2 Internal – SATA…

    StarTech.com 4 Port PCI Express USB 3.0 Card – 5Gbps – 2 External & 2 Internal – SATA…

    SABRENT USB 3.0 to SATA I/II/III Dual Bay External Hard Drive Docking Station for 2.5 or…

    SABRENT USB 3.0 to SATA I/II/III Dual Bay External Hard Drive Docking Station for 2.5 or…

    ORICO 2.5 SSD SATA to 3.5 Hard Drive Adapter Internal Drive Bay Converter Mounting…

    ORICO 2.5 SSD SATA to 3.5 Hard Drive Adapter Internal Drive Bay Converter Mounting…

    Mushkin MLB5C600AEEM16GX2 Mem 2*16gmushkin Mlb5c600aeem16gx2

    Mushkin MLB5C600AEEM16GX2 Mem 2*16gmushkin Mlb5c600aeem16gx2

    MSI MAG321CUP 32-inch 3840 x 2160 (UHD), VA, 4K Gaming Monitor 160Hz, Adaptive Sync,…

    MSI MAG321CUP 32-inch 3840 x 2160 (UHD), VA, 4K Gaming Monitor 160Hz, Adaptive Sync,…

    Logitech C920x HD Pro Webcam, Full HD 1080p/30fps Video, Clear Audio, HD Light…

    Logitech C920x HD Pro Webcam, Full HD 1080p/30fps Video, Clear Audio, HD Light…

    Lexar 128GB (2-Pack) Professional 1667x SD Card, UHS-II, C10, U3, V60, Full HD, 4K, Up…

    Lexar 128GB (2-Pack) Professional 1667x SD Card, UHS-II, C10, U3, V60, Full HD, 4K, Up…

    Charger for Lenovo Laptop – Ideapad 3 1 5 S340 S145 330 320 310, Flex 5, V14 V15 V17,…

    Charger for Lenovo Laptop – Ideapad 3 1 5 S340 S145 330 320 310, Flex 5, V14 V15 V17,…

    Kingston 240GB A400 SATA 3 2.5″ Internal SSD SA400S37/240G – HDD Replacement for…

    Kingston 240GB A400 SATA 3 2.5″ Internal SSD SA400S37/240G – HDD Replacement for…

  • Gaming
    Nood to OP GOD!! TIPS and TRICKS  in Minecraft Prison | Complex Gaming

    Nood to OP GOD!! TIPS and TRICKS in Minecraft Prison | Complex Gaming

    Eight minutes of footage from an unfinished open world D&D game has leaked

    Eight minutes of footage from an unfinished open world D&D game has leaked

    This Cute Game has Sharp Edges | Defenders of the Wild Review

    This Cute Game has Sharp Edges | Defenders of the Wild Review

    Honest Game Trailers | Marvel's Spider-Man: Miles Morales

    Honest Game Trailers | Marvel's Spider-Man: Miles Morales

    Survival Escape: Prison Game – Gameplay Walkthrough Part 1 – Season 1 & 2 Win (Android, iOS)

    Survival Escape: Prison Game – Gameplay Walkthrough Part 1 – Season 1 & 2 Win (Android, iOS)

    Stellaris updates are going to start coming more slowly, because new patches are causing new problems and QA testers can’t keep up

    Stellaris updates are going to start coming more slowly, because new patches are causing new problems and QA testers can’t keep up

    Space Marine 2 REVIEW | Based Game Reviews

    Space Marine 2 REVIEW | Based Game Reviews

    Zelda OoT 100% part 2 (HD)

    Zelda OoT 100% part 2 (HD)

    Daemon X Machina: Titanic Scion is adding a card game as the mecha sequel targets an increasingly specific type of guy

    Daemon X Machina: Titanic Scion is adding a card game as the mecha sequel targets an increasingly specific type of guy

  • Tesla
    4 Pack Aluminum Waterproof Wheel Cover Center Emblem Sticker for Tesla Model 3 Model Y…

    4 Pack Aluminum Waterproof Wheel Cover Center Emblem Sticker for Tesla Model 3 Model Y…

    Tesla Supercharger to CCS1 Adapter – 500A 1000V Fast Charging | NACS to CCS Converter…

    Tesla Supercharger to CCS1 Adapter – 500A 1000V Fast Charging | NACS to CCS Converter…

    OEDRO Floor Mats Fit for Tesla Model 3 Highland 2024 2025, All Weather Waterproof…

    OEDRO Floor Mats Fit for Tesla Model 3 Highland 2024 2025, All Weather Waterproof…

    2025 Upgraded Tesla Model Y Roof Sunshade Heat-Insulating, UV-Protective,…

    2025 Upgraded Tesla Model Y Roof Sunshade Heat-Insulating, UV-Protective,…

    BestEvMod Center Console Cup Holder Trim Cover Panel Decorative Cover Sticker ABS…

    BestEvMod Center Console Cup Holder Trim Cover Panel Decorative Cover Sticker ABS…

    TUFFIOM 46″x 36″x 6″ Universal Rooftop Cargo Carrier Basket, Heavy Duty Steel Roof Rack…

    TUFFIOM 46″x 36″x 6″ Universal Rooftop Cargo Carrier Basket, Heavy Duty Steel Roof Rack…

    Door Side Rear View Mirror Cover Compatible with 2024 2025 Tesla Cybertruck Accessories

    Door Side Rear View Mirror Cover Compatible with 2024 2025 Tesla Cybertruck Accessories

    4PCS Wheel Center Caps and 24PCS Black Lug nut Covers for Tesla Cybertruck Accessories…

    4PCS Wheel Center Caps and 24PCS Black Lug nut Covers for Tesla Cybertruck Accessories…

    Car Windshield Cleaning Tool, Microfiber Window Cleaner with 4 Washable and Reusable…

    Car Windshield Cleaning Tool, Microfiber Window Cleaner with 4 Washable and Reusable…

  • UFO
    Behind the Conspiracy – The Reptilians (Featuring Dr. David Miano)

    Behind the Conspiracy – The Reptilians (Featuring Dr. David Miano)

    Ross Coulthart in Egypt: Ancient UFO proof revealed? | Reality Check

    Ross Coulthart in Egypt: Ancient UFO proof revealed? | Reality Check

    TUFFY – World’s Tuffest Soft Dog Toy – Alien Green-Squeaker- Multiple Layers. Made Durable, Strong & Tough. Interactive Play (Tug, Toss & Fetch). Machine Washable & Floats

    TUFFY – World’s Tuffest Soft Dog Toy – Alien Green-Squeaker- Multiple Layers. Made Durable, Strong & Tough. Interactive Play (Tug, Toss & Fetch). Machine Washable & Floats

    The Fermi Paradox: Where Are All the Aliens? #aliens #cosmos #cosmicmysteries #extraterrestrial

    The Fermi Paradox: Where Are All the Aliens? #aliens #cosmos #cosmicmysteries #extraterrestrial

    Lightweight Cotton Beanie, Summer Skull Caps, Breathable Thin Sleep Hats for Running Cycling Hiking Fishing

    Lightweight Cotton Beanie, Summer Skull Caps, Breathable Thin Sleep Hats for Running Cycling Hiking Fishing

    A Glitch In The Matrix Caught On Camera At Disneyland #shorts

    A Glitch In The Matrix Caught On Camera At Disneyland #shorts

    Men’s Short Sleeve Workout Shirts Quick Dry Lightweight T-Shirts Gym Running Athletic Tshirt for Summer

    Men’s Short Sleeve Workout Shirts Quick Dry Lightweight T-Shirts Gym Running Athletic Tshirt for Summer

    Unveiling the Truth: Garry Nolan’s Journey in #UFO Research | Dr. Garry Nolan

    Unveiling the Truth: Garry Nolan’s Journey in #UFO Research | Dr. Garry Nolan

    Dimzmars Spacecraft Model Interstellar Prowler Movie Spaceship Model

    Dimzmars Spacecraft Model Interstellar Prowler Movie Spaceship Model

No Result
View All Result
  • TC
  • AI
    Artificial Intelligence

    Update on the AWS DeepRacer Student Portal

    Artificial Intelligence

    INRFlow: Flow Matching for INRs in Ambient Space

    Artificial Intelligence

    Building a custom text-to-SQL agent using Amazon Bedrock and Converse API

    Artificial Intelligence

    How Apollo Tyres is unlocking machine insights using agentic AI-powered Manufacturing Reasoner

    Artificial Intelligence

    Automatically Build AI Workflows with Magical AI

    Artificial Intelligence

    Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

    Artificial Intelligence

    Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale

    Artificial Intelligence

    7 Python Errors That Are Actually Features

    Artificial Intelligence

    10 Awesome OCR Models for 2025

  • Crypto
    Moscow Releases Tax Calculator for Russian Crypto Miners

    Russian Power Firm Debuts Bitcoin Mining Mutual Investment Fund

    Blockchain Lending Boom: Tokenized Private Credit Nears $14 Billion

    Blockchain Lending Boom: Tokenized Private Credit Nears $14 Billion

    Pi Network Dives Toward $1 – Here’s Why Investors Are Nervous

    Trump Coin Price Prediction: TRUMP Gets Binance Backing

    Crypto Firms on Wall Street Diverge: Miners Dip, Strategy Clones Vary

    Crypto Firms on Wall Street Diverge: Miners Dip, Strategy Clones Vary

    Bitcoin Volume Metric Hints A ‘$130k-$135k Btc Will Happen’ By Q3 2025

    Bitcoin Volume Metric Hints A ‘$130k-$135k Btc Will Happen’ By Q3 2025

    Best Presales to Buy Today – Which Coins Are Poised for a Breakout?

    DOGE, SOL, and XRP Holders Can Now Cash Out Through BCC Mining

    China’s Central Bank Governor Envisions the End of US Dollar Dominance

    China’s Central Bank Governor Envisions the End of US Dollar Dominance

    K33 Plans Major Bitcoin Acquisition With $8 Million Share Issue

    K33 Plans Major Bitcoin Acquisition With $8 Million Share Issue

    Bitcoin Bull Cycle is Over: CryptoQuant CEO

    Trump Wants GENIUS Bill Sent Straight to His Desk, Without Delay

  • Cybersecurity
    Cybersecurity

    Meta Adds Passkey Login Support to Facebook for Android and iOS Users

    Cybersecurity

    FedRAMP at Startup Speed: Lessons Learned

    Cybersecurity

    CISA Warns of Active Exploitation of Linux Kernel Privilege Escalation Vulnerability

    Cybersecurity

    Ex-CIA Analyst Sentenced to 37 Months for Leaking Top Secret National Defense Documents

    Cybersecurity

    Critical RCE Bug Rated 9.9 CVSS in Backup & Replication

    Cybersecurity

    Hard-Coded ‘b’ Password in Sitecore XP Sparks Major RCE Risk in Enterprise Deployments

    Cybersecurity

    AI Agents Run on Secret Accounts — Learn How to Secure Them in This Webinar

    Cybersecurity

    How to Address the Expanding Security Risk

    Cybersecurity

    ConnectWise to Rotate ScreenConnect Code Signing Certificates Due to Security Risks

  • Deals
    StarTech.com 4 Port PCI Express USB 3.0 Card – 5Gbps – 2 External & 2 Internal – SATA…

    StarTech.com 4 Port PCI Express USB 3.0 Card – 5Gbps – 2 External & 2 Internal – SATA…

    SABRENT USB 3.0 to SATA I/II/III Dual Bay External Hard Drive Docking Station for 2.5 or…

    SABRENT USB 3.0 to SATA I/II/III Dual Bay External Hard Drive Docking Station for 2.5 or…

    ORICO 2.5 SSD SATA to 3.5 Hard Drive Adapter Internal Drive Bay Converter Mounting…

    ORICO 2.5 SSD SATA to 3.5 Hard Drive Adapter Internal Drive Bay Converter Mounting…

    Mushkin MLB5C600AEEM16GX2 Mem 2*16gmushkin Mlb5c600aeem16gx2

    Mushkin MLB5C600AEEM16GX2 Mem 2*16gmushkin Mlb5c600aeem16gx2

    MSI MAG321CUP 32-inch 3840 x 2160 (UHD), VA, 4K Gaming Monitor 160Hz, Adaptive Sync,…

    MSI MAG321CUP 32-inch 3840 x 2160 (UHD), VA, 4K Gaming Monitor 160Hz, Adaptive Sync,…

    Logitech C920x HD Pro Webcam, Full HD 1080p/30fps Video, Clear Audio, HD Light…

    Logitech C920x HD Pro Webcam, Full HD 1080p/30fps Video, Clear Audio, HD Light…

    Lexar 128GB (2-Pack) Professional 1667x SD Card, UHS-II, C10, U3, V60, Full HD, 4K, Up…

    Lexar 128GB (2-Pack) Professional 1667x SD Card, UHS-II, C10, U3, V60, Full HD, 4K, Up…

    Charger for Lenovo Laptop – Ideapad 3 1 5 S340 S145 330 320 310, Flex 5, V14 V15 V17,…

    Charger for Lenovo Laptop – Ideapad 3 1 5 S340 S145 330 320 310, Flex 5, V14 V15 V17,…

    Kingston 240GB A400 SATA 3 2.5″ Internal SSD SA400S37/240G – HDD Replacement for…

    Kingston 240GB A400 SATA 3 2.5″ Internal SSD SA400S37/240G – HDD Replacement for…

  • Gaming
    Nood to OP GOD!! TIPS and TRICKS  in Minecraft Prison | Complex Gaming

    Nood to OP GOD!! TIPS and TRICKS in Minecraft Prison | Complex Gaming

    Eight minutes of footage from an unfinished open world D&D game has leaked

    Eight minutes of footage from an unfinished open world D&D game has leaked

    This Cute Game has Sharp Edges | Defenders of the Wild Review

    This Cute Game has Sharp Edges | Defenders of the Wild Review

    Honest Game Trailers | Marvel's Spider-Man: Miles Morales

    Honest Game Trailers | Marvel's Spider-Man: Miles Morales

    Survival Escape: Prison Game – Gameplay Walkthrough Part 1 – Season 1 & 2 Win (Android, iOS)

    Survival Escape: Prison Game – Gameplay Walkthrough Part 1 – Season 1 & 2 Win (Android, iOS)

    Stellaris updates are going to start coming more slowly, because new patches are causing new problems and QA testers can’t keep up

    Stellaris updates are going to start coming more slowly, because new patches are causing new problems and QA testers can’t keep up

    Space Marine 2 REVIEW | Based Game Reviews

    Space Marine 2 REVIEW | Based Game Reviews

    Zelda OoT 100% part 2 (HD)

    Zelda OoT 100% part 2 (HD)

    Daemon X Machina: Titanic Scion is adding a card game as the mecha sequel targets an increasingly specific type of guy

    Daemon X Machina: Titanic Scion is adding a card game as the mecha sequel targets an increasingly specific type of guy

  • Tesla
    4 Pack Aluminum Waterproof Wheel Cover Center Emblem Sticker for Tesla Model 3 Model Y…

    4 Pack Aluminum Waterproof Wheel Cover Center Emblem Sticker for Tesla Model 3 Model Y…

    Tesla Supercharger to CCS1 Adapter – 500A 1000V Fast Charging | NACS to CCS Converter…

    Tesla Supercharger to CCS1 Adapter – 500A 1000V Fast Charging | NACS to CCS Converter…

    OEDRO Floor Mats Fit for Tesla Model 3 Highland 2024 2025, All Weather Waterproof…

    OEDRO Floor Mats Fit for Tesla Model 3 Highland 2024 2025, All Weather Waterproof…

    2025 Upgraded Tesla Model Y Roof Sunshade Heat-Insulating, UV-Protective,…

    2025 Upgraded Tesla Model Y Roof Sunshade Heat-Insulating, UV-Protective,…

    BestEvMod Center Console Cup Holder Trim Cover Panel Decorative Cover Sticker ABS…

    BestEvMod Center Console Cup Holder Trim Cover Panel Decorative Cover Sticker ABS…

    TUFFIOM 46″x 36″x 6″ Universal Rooftop Cargo Carrier Basket, Heavy Duty Steel Roof Rack…

    TUFFIOM 46″x 36″x 6″ Universal Rooftop Cargo Carrier Basket, Heavy Duty Steel Roof Rack…

    Door Side Rear View Mirror Cover Compatible with 2024 2025 Tesla Cybertruck Accessories

    Door Side Rear View Mirror Cover Compatible with 2024 2025 Tesla Cybertruck Accessories

    4PCS Wheel Center Caps and 24PCS Black Lug nut Covers for Tesla Cybertruck Accessories…

    4PCS Wheel Center Caps and 24PCS Black Lug nut Covers for Tesla Cybertruck Accessories…

    Car Windshield Cleaning Tool, Microfiber Window Cleaner with 4 Washable and Reusable…

    Car Windshield Cleaning Tool, Microfiber Window Cleaner with 4 Washable and Reusable…

  • UFO
    Behind the Conspiracy – The Reptilians (Featuring Dr. David Miano)

    Behind the Conspiracy – The Reptilians (Featuring Dr. David Miano)

    Ross Coulthart in Egypt: Ancient UFO proof revealed? | Reality Check

    Ross Coulthart in Egypt: Ancient UFO proof revealed? | Reality Check

    TUFFY – World’s Tuffest Soft Dog Toy – Alien Green-Squeaker- Multiple Layers. Made Durable, Strong & Tough. Interactive Play (Tug, Toss & Fetch). Machine Washable & Floats

    TUFFY – World’s Tuffest Soft Dog Toy – Alien Green-Squeaker- Multiple Layers. Made Durable, Strong & Tough. Interactive Play (Tug, Toss & Fetch). Machine Washable & Floats

    The Fermi Paradox: Where Are All the Aliens? #aliens #cosmos #cosmicmysteries #extraterrestrial

    The Fermi Paradox: Where Are All the Aliens? #aliens #cosmos #cosmicmysteries #extraterrestrial

    Lightweight Cotton Beanie, Summer Skull Caps, Breathable Thin Sleep Hats for Running Cycling Hiking Fishing

    Lightweight Cotton Beanie, Summer Skull Caps, Breathable Thin Sleep Hats for Running Cycling Hiking Fishing

    A Glitch In The Matrix Caught On Camera At Disneyland #shorts

    A Glitch In The Matrix Caught On Camera At Disneyland #shorts

    Men’s Short Sleeve Workout Shirts Quick Dry Lightweight T-Shirts Gym Running Athletic Tshirt for Summer

    Men’s Short Sleeve Workout Shirts Quick Dry Lightweight T-Shirts Gym Running Athletic Tshirt for Summer

    Unveiling the Truth: Garry Nolan’s Journey in #UFO Research | Dr. Garry Nolan

    Unveiling the Truth: Garry Nolan’s Journey in #UFO Research | Dr. Garry Nolan

    Dimzmars Spacecraft Model Interstellar Prowler Movie Spaceship Model

    Dimzmars Spacecraft Model Interstellar Prowler Movie Spaceship Model

No Result
View All Result
Techcratic
No Result
View All Result
Home Hacker News

Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference | by Zhihao Jia | Jun, 2025

Hacker News by Hacker News
June 19, 2025
in Hacker News
Reading Time: 19 mins read
122 8
A A
0

2025-06-19 15:20:00
zhihaojia.medium.com

One of the most effective ways to reduce latency in LLM inference is to fuse all computation and communication into a single megakernel — also known as a persistent kernel. In this design, the system launches just one GPU kernel to execute the entire model — from layer-by-layer computation to inter-GPU communication — without interruption. This approach offers several key performance advantages:

  1. Eliminates kernel launch overhead, even in multi-GPU settings, by avoiding repeated kernel invocations;
  2. Enables software pipelining across layers, allowing the kernel to begin loading data for the next layer while computing the current one;
  3. Overlaps computation and communication, as a megakernel can simultaneously execute compute operations and inter-GPU communication to hide latency.

Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

Can we automate this process through compilation? Motivated by this question, our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

A key advantage of MPK is extremely low latency for LLM inference by eliminating kernel launch overhead and maximally overlapping computation, data loading, and inter-GPU communication across layers.

Figure 1. Comparing LLM decoding latency between MPK and existing systems. We used a 39-token prompt and generated 512 tokens without speculative decoding.

Figure 1 illustrates a performance comparison between MPK and existing LLM inference systems on both single- and multi-GPU configurations. On a single NVIDIA A100 40GB GPU, MPK reduces per-token decoding latency from 14.5 ms — as achieved by optimized systems like vLLM and SGLang — to 12.5 ms, approaching the theoretical lower bound of 10 ms (based on loading 16 GB of weights with 1.6 TB/s memory bandwidth).

Beyond single-GPU optimization, MPK fuses computation and inter-GPU communication into a single megakernel. This design enables MPK to maximally overlap computation and communication. As a result, the performance improvements of MPK over current systems increase with the number of GPUs, making it particularly effective for multi-GPU deployments.

The rest of this blog dives deeper into how MPK works:

  • Part 1 introduces the MPK compiler, which transforms an LLM’s computation graph into an optimized task graph;
  • Part 2 covers the MPK runtime, which executes this task graph within a megakernel to achieve high throughput and low latency.

The computation performed by a large language model (LLM) is typically represented as a computation graph, where each node corresponds to a compute operation (e.g., matrix multiplication, attention) or a collective communication primitive (e.g., all-reduce), and edges denote data dependencies between operations. In existing systems, each operator is generally executed via a dedicated GPU kernel. However, this kernel-per-operator execution model often fails to exploit pipelining opportunities, since dependencies are enforced at a coarse granularity — across entire kernels — rather than the actual data units.

The computation of an LLM is generally represented as a computation graph, where each node is a compute operator (e.g., matrix multiplication, attention) or a collective communication primitive (e.g., allreduce), and edges denote data dependencies between operators. Existing systems generally launch a dedicated GPU kernel for each operator. However, this kernel-per-operator approach often fails to exploit pipelining opportunities, since dependencies are enforced at a coarse granularity — across entire kernels — rather than the actual data units.

Consider a typical example: an allreduce operation following a matrix multiplication. In existing kernel-per-operator systems, the allreduce kernel must wait until the entire matmul kernel completes. In reality, though, each chunk of data for the allreduce only depends on a portion of the matmul output. This mismatch between logical and actual data dependencies limits the potential for overlapping computation and communication.

Figure 2. The MPK compiler transforms an LLM’s computation graph (defined in PyTorch) into an optimized, fine-grained task graph that exposes maximum parallelism. The right-hand side illustrates an alternative — but suboptimal — task graph that introduces unnecessary data dependencies and global barriers, limiting pipelining opportunities across layers.

To address this issue, MPK introduces a compiler that automatically transforms the LLM’s computation graph into a fine-grained task graph. This task graph explicitly captures dependencies at the sub-kernel level, enabling more aggressive pipelining across layers.

In an MPK task graph:

  • Each task (shown as a rectangle in Figure 2) represents a unit of computation or communication assigned to a single GPU streaming multiprocessor (SM).
  • Each event (shown as a circle) represents a synchronization point between tasks.
  • Each task has an outgoing edge to a triggering event, which is activated once all associated tasks complete.
  • Each tasks also has an incoming edge from a dependent event, indicating the task can start execution as soon as the event is activated.

Task graphs allow MPK to uncover pipelining opportunities that would be missed in computation graphs. For example, MPK can construct an optimized task graph where each allreduce task depends only on the corresponding matmul task that produces its input — enabling partial execution and overlap.

In addition to generating an optimized task graph, MPK also automatically generates high-performance CUDA implementations for each task using the Mirage kernel superoptimizer. This ensures that each task runs efficiently on a GPU SM. (For more about the kernel superoptimizer, see this post.)

MPK includes an on-GPU runtime system that executes the task graph entirely within a single GPU megakernel, allowing for fine-grained control over task execution and scheduling without any kernel launches during inference.

To achieve this, MPK statically partitions all streaming multiprocessors (SMs) on a GPU into two roles: workers and schedulers. The number of worker and scheduler SMs is fixed at kernel launch time and matches the total number of physical SMs, avoiding any dynamic context switching overhead.

Workers

Each worker operates on an SM and maintains a dedicated task queue. It follows a simple but efficient execution loop:

  1. Fetch the next task from its queue.
  2. Execute the task (e.g., matrix multiplication, attention, or inter-GPU data transfers).
  3. Notify the triggering event upon task completion.
  4. Repeat.

This design ensures that workers remain fully utilized while enabling task execution to proceed asynchronously across layers and operations.

Schedulers

Scheduling decisions are handled by MPK’s distributed schedulers, each of which runs on a single warp. Because each SM can accommodate multiple warps, up to four schedulers can run concurrently per SM. Each scheduler maintains a queue of activated events. It continuously:

  1. Dequeues activated events whose dependencies are satisfied (i.e., all prerequisite tasks have completed).
  2. Launches the set of tasks that depend on the activated event.

This decentralized scheduling mechanism minimizes coordination overhead while enabling scalable execution across SMs.

Figure 3. The MPK runtime executes a task graph in a megakernel.

Event-Driven Execution

Figure 3 illustrates MPK’s execution timeline. Each rectangle represents a task running on a worker; each circle represents an event. As a task completes, it increments the counter for its corresponding triggering event. When the event counter reaches a pre-defined threshold, the event is considered activated and is enqueued into a scheduler’s event queue. The scheduler then launches any downstream tasks that depend on this event.

This design allows for fine-grained software pipelining and overlap between computation and communication. For example:

  • Matmul tasks can execute in parallel with attention tasks from different layers.
  • Allreduce communication can begin as soon as partial matmul results are available.

Because all scheduling and task transitions occur within a single kernel context, the overhead between tasks is extremely low — typically just 1–2 microseconds — enabling efficient execution of multi-layer, multi-GPU LLM workloads.

Our vision for MPK is to make megakernel compilation both easy to use and highly performant. Currently you can compile an LLM into a megakernel with just a few dozen lines of Python code — mainly to specify the megakernel’s inputs and outputs. We’re excited about this direction, and there’s still much more to explore. Some of the key areas we’re actively working on include:

  • Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.
  • Handling workload dynamism. MPK currently builds a static task graph, which limits its ability to handle dynamic workloads such as mixture-of-experts (MoE) models. We’re developing new compilation strategies that allow MPK to support dynamic control flow and conditional execution inside megakernels.
  • Advanced scheduling and task assignment: MPK unlocks a new level of fine-grained scheduling at the task level. While our current implementation uses simple round-robin scheduling to distribute tasks across SMs, we see exciting opportunities in advanced scheduling policies — such as priority-aware or throughput-optimized strategies — for use cases like latency-SLO-driven serving or hybrid batching.

We believe MPK represents a foundational shift in how LLM inference workloads are compiled and executed on GPUs, and we’re eager to collaborate with the community to push this vision forward.

To learn more about MPK and explore our code and documentation, please visit our project website: https://github.com/mirage-project/mirage.

We welcome feedback, contributions, and collaborations from the community!

Source Link


Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.

Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!


Start your free Amazon Prime trial
today and unlock unlimited streaming and more!

Help Power Techcratic’s Future – Scan To Support

If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.

As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!

BITCOIN

Bitcoin Logo

Bitcoin QR Code

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Scan the QR code with your crypto wallet app

DOGECOIN

Dogecoin Logo

Dogecoin QR Code

D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA

Scan the QR code with your crypto wallet app

ETHEREUM

Ethereum Logo

Ethereum QR Code

0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a

Scan the QR code with your crypto wallet app

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: Hacker News
Share162ShareTweet101
Previous Post

ORICO 2.5 SSD SATA to 3.5 Hard Drive Adapter Internal Drive Bay Converter Mounting…

Next Post

OnePlus rumored to be working on new gaming series

Hacker News

Hacker News

Stay updated with Hacker News, where technology meets entrepreneurial spirit. Get the latest on tech trends, startup news, and discussions from the tech community. Read the latest updates here at Techcratic.

Related Posts

GitHub – Dahrkael/ExTracker: Elixir-powered BitTorrent Tracker
Hacker News

GitHub – Dahrkael/ExTracker: Elixir-powered BitTorrent Tracker

June 19, 2025
1.3k
Curved-Crease Sculpture by Erik and Martin Demaine
Hacker News

Curved-Crease Sculpture by Erik and Martin Demaine

June 19, 2025
1.3k
elliptic-curves.art
Hacker News

elliptic-curves.art

June 19, 2025
1.3k
lunchbox-computer/bento: a computer in a keyboard
Hacker News

lunchbox-computer/bento: a computer in a keyboard

June 18, 2025
1.3k
Framework Laptop 12 review: I’m excited to see what the 2nd generation looks like
Hacker News

Framework Laptop 12 review: I’m excited to see what the 2nd generation looks like

June 18, 2025
1.3k
montyanderson/incant: Add magic spells to your code.
Hacker News

montyanderson/incant: Add magic spells to your code.

June 18, 2025
1.3k
SparcLab/OpenSERDES: Digitally synthesizable architecture for SerDes using Skywater Open PDK 130 nm technology.
Hacker News

SparcLab/OpenSERDES: Digitally synthesizable architecture for SerDes using Skywater Open PDK 130 nm technology.

June 18, 2025
1.3k
bgreenwell/lstr: A fast, minimalist directory tree viewer, written in Rust.
Hacker News

bgreenwell/lstr: A fast, minimalist directory tree viewer, written in Rust.

June 17, 2025
1.3k
Load More
Next Post
Smartphone

OnePlus rumored to be working on new gaming series

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Tech Resources

  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Forbes
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo
  • Google News
  • Hacker News
  • Harvard Tech
  • I Like Cats ™
  • I Like Dogs ™
  • LifeHacker
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • PC World
  • Photofocus
  • Physics
  • Random Tech
  • Retro Rewind ™
  • Robot Report
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Art
  • Tech Careers
  • Tech Deals
  • Techcratic ™
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Wired
  • ZDNet

Tech News

  • 30 Second Tech ™
  • AI
  • Apple Insider
  • Ars Technica
  • CNET
  • ComputerWorld
  • Crypto News
  • Cybersecurity
  • Endgadget
  • ExtremeTech
  • Fossbytes
  • Gaming
  • GeekWire
  • Gizmodo

Tech News

  • Harvard Tech
  • MacRumors
  • Macworld
  • Mashable
  • Microsoft
  • MIT Tech
  • Physics
  • PC World
  • Random Tech
  • Retro Rewind ™
  • SiliconANGLE
  • SlashGear
  • Smartphone
  • StackSocial
  • Tech Careers

Tech News​

  • Tech Art
  • TechCrunch
  • Techdirt
  • TechRepublic
  • Techs Got To Eat ™
  • TechSpot
  • Tesla
  • The Verge
  • TNW
  • Trusted Reviews
  • UFO
  • VentureBeat
  • Visual Capitalist
  • Wired
  • ZDNet

Site Links

  • About Techcratic
  • Affiliate Disclaimer
  • Affiliate Link Policy
  • Contact Techcratic
  • Dealors Discount Store
  • Privacy and Security Disclaimer
  • Privacy Policy
  • RSS Feed
  • Site Map
  • Support Techcratic
  • Techcratic
  • Tech Deals
  • TOS
  • 𝕏
Click For A Secret Deal

Techcratic – Your All In One Tech Hub © 2020 – 2025
All Rights Reserved
∞

No Result
View All Result
  • 30 Second Tech ™
  • AI
  • App Zone ™
  • Apple
  • Ars Technica
  • CNET
  • Crypto News
  • Cybersecurity
  • Endgadget
  • Gaming
  • I Like Cats ™
  • I Like Dogs ™
  • MacRumors
  • Macworld
  • Tech Deals
  • Techcratic ™
  • Techs Got To Eat ™
  • Tesla
  • UFO
  • Wired