Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell | NVIDIA Technical Blog

8 January 2026
colind88
News Feed

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads users—from consumers to enterprises—to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt.

Through extreme co-design across GPUs, CPUs, networking, software, power delivery, and cooling, NVIDIA continues to drive up token throughput per watt, which reduces cost per million tokens.

Additionally, NVIDIA continues to enhance its software stacks to achieve even greater levels of performance from existing platforms. This increases the value of the large installed base of NVIDIA GPUs across cloud service providers (CSPs), GPU clouds, model builders, enterprises, and others, enabling that infrastructure to remain productive for longer.

In this post, we show how recent updates to the NVIDIA inference software stack—running on the NVIDIA Blackwell architecture—as well as use of the full capabilities available in the stack are enabling large performance gains across several scenarios on DeepSeek-R1, a state-of-the-art sparse mixture-of-experts (MoE) reasoning model.

Latest NVIDIA TensorRT-LLM software boosts reasoning inference performance

The NVIDIA GB200 NVL72 rack-scale platform connects 72 NVIDIA Blackwell GPUs using fifth-generation NVIDIA NVLink interconnect and NVLink Switch chips, providing 1,800 GB/s of bidirectional bandwidth between all chips in the rack. This large scale-up domain is optimized for models based on sparse MoE architectures, which require frequent exchanges of data between experts to generate tokens.

The Blackwell architecture also incorporates hardware acceleration for the NVFP4 data format, an NVIDIA-designed four-bit floating point format that better preserves accuracy compared to alternative FP4 formats. In addition, optimizations like disaggregated serving—which perform prefill operations on one set of GPUs and decode operations on a different set—also take advantage of the NVL72 architecture and NVLink Switch technology.

These architectural innovations enable NVIDIA GB200 NVL72 to deliver industry-leading performance on the latest open models, including DeepSeek-R1, a 671 billion-parameter sparse MoE that activates 37 billion parameters for each token.

A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis, 8K input sequence length and 1K output sequence length, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision. — *Figure 1. GB200 NVL72 DeepSeek-R1 token throughput using 8K/1K sequence length has increased substantially with the latest NVIDIA TensorRT-LLM software.*

GB200 NVL72 had previously demonstrated leading per-GPU throughput on DeepSeek-R1 across the throughput/interactivity curves for both 1K/1K and 8K/1K input/output sequence lengths.

A chart plotting interactivity on the x-axis and throughput per GPU on the y-axis using 1K input and 1K output sequence lengths, with GB200 NVL72 with October 2025 software plotted in gray and the January 2026 software plotted in green and higher across the curve. Both are using NVFP4 precision. — *Figure 2. GB200 NVL72 DeepSeek-R1 token throughput using 1K/1K sequence length has increased substantially with the latest NVIDIA TensorRT-LLM software.*

The latest enhancements to the NVIDIA TensorRT-LLM open source library for optimizing LLM inference dramatically accelerates performance on the same platform, with the throughput of each Blackwell GPU increasing by up to 2.8x in the past three months.

The optimizations behind these results include:

Expanded use of NVIDIA Programmatic Dependent Launch (PDL) to reduce kernel launch latencies, helping to increase throughput across the range of interactivity levels
Many low-level kernel optimizations to more efficiently utilize NVIDIA Blackwell Tensor Cores
Newly optimized implementation of all-to-all communication primitives that eliminate an additional intermediate buffer on the receiver side

TensorRT LLM provides a high-level Python LLM API. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. These optimizations are available today in the latest version of TensorRT-LLM.

Accelerating NVIDIA HGX B200 performance with multi-token prediction and NVFP4

The NVIDIA HGX B200 platform—comprised of eight Blackwell GPUs connected using the fifth-generation NVLink interconnect and NVLink Switch—also achieves outstanding DeepSeek-R1 inference performance for air-cooled deployments.

Two key technologies enable very large DeepSeek-R1 inference performance increases on HGX B200. The first is the use of MTP, which provides a significant increase in throughput across the range of interactivity levels. This is observed across all three tested input/output sequence combinations.

A chart plotting per-user interactivity on the x-axis and token throughput per GPU on the y-axis. With the progression from FP8 MTP Off (light gray) to FP8 with MTP On (darker gray) to NVFP4 with MTP On (green), the curves continue to shift to the right, indicating more throughput at a given interactivity level and enabling higher peak interactivity. — *Figure 3. Throughput-versus-interactivity curves across FP8 without MTP, FP8 with MTP, and NVFP4 with MTP on HGX B200, with 1K/1K sequence length and aggregated serving.*

The second is the use of NVFP4, taking full advantage of the significant compute capabilities available in the Blackwell GPU to boost performance while preserving accuracy.

NVFP4 is activated by the full NVIDIA software stack, including TensorRT-LLM and NVIDIA TensorRT Model Optimizer, to ensure both high performance and preservation of accuracy. That enables yet another large throughput boost at a given interactivity level, and once again allows for even higher interactivity levels to be possible on the same HGX B200 platform.

By leveraging the full capabilities of the NVIDIA Blackwell platform, LLMs can serve more users and deliver significantly better experiences to each of those users.

Delivering continuous performance gains

Through relentless optimization, NVIDIA continues to deliver higher performance across the entire technology stack. It drives up token throughput on the full range of AI models, both through an annual product cadence as well as continued workload optimization to deliver more performance and value from existing products.

The NVIDIA Blackwell architecture delivers industry-leading inference performance, and with the latest software innovations in TensorRT-LLM, NVIDIA is delivering yet another big inference boost for customers, partners, and the AI ecosystem at large.

Please visit the NVIDIA Data Center Deep Learning Product Performance page to learn more about the industry-leading performance delivered by the NVIDIA full-stack platform.