NVIDIA TensorRT and Triton: Production LLM Inference

2026.01.26

Deploying LLMs at scale requires more than loading a model and serving requests. Production systems must handle concurrent users, minimize latency, maximize GPU utilization, and provide observability. NVIDIA’s TensorRT-LLM and Triton Inference Server form a battle-tested stack for these requirements.

TensorRT-LLM became fully open-source in March 2025 and now delivers over 40,000 tokens per second on Blackwell B200 GPUs running Llama 4¹. The stack integrates with NVIDIA Dynamo for datacenter-scale orchestration across thousands of GPUs².

This guide covers the complete deployment pipeline: building optimized TensorRT engines, configuring Triton for LLM workloads, and tuning for production performance.

TensorRT-LLM overview

TensorRT-LLM is NVIDIA’s open-source library for compiling and running large language models on NVIDIA GPUs. Built on PyTorch, it provides a Python API for model definition while generating highly optimized CUDA kernels for inference³. The current release (January 2026) uses PyTorch 2.9.0, TensorRT 10.9, and CUDA 12.8.1⁴.

The library handles several optimization categories:

Quantization: TensorRT-LLM supports NVFP4, FP8, INT4 AWQ, and INT8 SmoothQuant quantization formats. NVFP4 is a 4-bit floating-point format introduced with Blackwell GPUs that reduces KV cache memory by 50% compared to FP8 while maintaining less than 1% accuracy loss on benchmarks including LiveCodeBench, MMLU-PRO, and MBPP⁵. FP8 on Hopper and Blackwell architectures delivers 2.5-3x inference speed improvements compared to FP16⁶.

Kernel fusion: Multiple transformer operations combine into single CUDA kernels. LayerNorm, matrix multiplications, bias additions, and activation functions execute together instead of requiring separate kernel launches and memory transfers⁷.

Attention optimizations: Custom FlashAttention kernels, multi-head/multi-query/grouped-query attention support, and fused attention implementations reduce memory bandwidth requirements.

Parallelism: Tensor parallelism splits matrix multiplications across GPUs. Pipeline parallelism distributes model layers sequentially across devices. Wide expert parallelism (EP) enables efficient Mixture-of-Experts inference on models like DeepSeek-R1 and Llama 4⁸. All strategies enable serving models larger than single-GPU memory⁹.

Speculative decoding: TensorRT-LLM supports EAGLE-3, multi-token prediction (MTP), and ReDrafter techniques for accelerated token generation. EAGLE-3 with speculative decoding achieves up to 4x speedup on Llama 4 Maverick¹⁰. ReDrafter, developed by Apple and integrated into TensorRT-LLM, achieves up to 2.7x throughput improvements on H100 GPUs¹¹.

Triton + TensorRT-LLM Architecture

Client Applications gRPC / HTTP / OpenAI-compatible API

Triton Inference Server Request scheduling, model management, metrics

Model RepositorySchedulerBackend Manager

 Ensemble Pipeline Preprocessing → TRT-LLM → Postprocessing 
 TokenizerTensorRT-LLM BackendDetokenizer 

TensorRT-LLM Runtime Executor API with in-flight batching

KV Cache ManagerBatch SchedulerCUDA Graphs

 TensorRT Engine Optimized CUDA kernels 
 Fused AttentionQuantized WeightsLayer Fusion 

NVIDIA GPU H100 / A100 / L40S

Core Components

Optimization Layers

Building TensorRT engines for LLMs

TensorRT-LLM converts model weights into optimized TensorRT engines through a two-step process: checkpoint conversion and engine building.

Step 1: Quantize and convert checkpoints

The quantize.py script converts Hugging Face checkpoints to TensorRT-LLM format while applying quantization:

# NVFP4 quantization (Blackwell GPUs - recommended for B200/GB200)
python examples/quantization/quantize.py \
  --model_dir /path/to/llama-70b \
  --qformat nvfp4 \
  --kv_cache_dtype nvfp4 \
  --output_dir /output/llama-70b-nvfp4 \
  --tp_size 4

# FP8 quantization (Hopper/Ada/Blackwell GPUs)
python examples/quantization/quantize.py \
  --model_dir /path/to/llama-70b \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir /output/llama-70b-fp8 \
  --tp_size 4

# INT4 AWQ quantization (all GPU generations)
python examples/quantization/quantize.py \
  --model_dir /path/to/llama-70b \
  --qformat int4_awq \
  --awq_block_size 64 \
  --output_dir /output/llama-70b-int4awq \
  --tp_size 4

The --tp_size parameter specifies tensor parallelism degree. Set this to the number of GPUs you will use for inference. INT4 AWQ uses block-wise quantization; smaller block sizes (64 vs 128) provide better accuracy at marginal compute cost¹². NVFP4 uses block-wise quantization with size 16 and FP8 scaling factors for higher precision during dequantization⁵.

Step 2: Build the engine

The trtllm-build command compiles checkpoints into TensorRT engines:

trtllm-build \
  --checkpoint_dir /output/llama-70b-fp8 \
  --output_dir /engines/llama-70b-fp8 \
  --gemm_plugin fp8 \
  --gpt_attention_plugin fp8 \
  --max_batch_size 64 \
  --max_input_len 4096 \
  --max_seq_len 8192 \
  --use_fused_mlp enable \
  --workers 4

Key build parameters:

Parameter	Purpose
`--gemm_plugin`	Enables cuBLASLt for optimized matrix operations
`--gpt_attention_plugin`	Uses efficient attention kernels with in-place KV cache updates
`--use_fused_mlp`	Enables horizontal fusion in GatedMLP layers
`--max_batch_size`	Maximum concurrent sequences the engine supports
`--max_input_len`	Maximum input context length
`--max_seq_len`	Maximum total sequence length (input + output)

For FP8 on Hopper GPUs, the GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel¹³.

Layer fusion patterns

TensorRT’s compiler identifies and fuses operation sequences automatically. Common fusion patterns include:

RMSNorm fusion: Combines normalization with subsequent quantization
Attention fusion: Merges Q/K/V projections with attention computation
MLP fusion: Fuses gate and up projections with activation functions
AllReduce fusion: Combines reduction with LayerNorm after multi-GPU communication

The --reduce_fusion enable flag eliminates extra copies from local buffers to shared buffers in communication kernels¹⁴.

Disaggregated serving

Disaggregated serving separates compute-intensive prefill operations from memory-bound decode operations onto specialized hardware clusters. This architecture addresses the fundamental mismatch between prefill (compute-bound, benefits from high FLOPS) and decode (memory-bound, benefits from high memory bandwidth) phases¹⁵.

TensorRT-LLM supports three disaggregated serving approaches:

trtllm-serve: A command-line utility that deploys OpenAI-compatible servers for each context and generation instance, with an orchestrator coordinating requests.

Dynamo integration: NVIDIA Dynamo orchestrates requests across prefill and decode workers with KV-cache-aware routing. The smart router determines optimal decode workers based on KV cache block availability².

NIXL acceleration: The NVIDIA Inference Exchange Library (NIXL) accelerates KV cache transfer between GPUs with low-latency communication primitives.

Disaggregated architectures demonstrate up to 6.4x throughput improvements and 20x reduction in latency variance. Organizations report 15-40% infrastructure cost reductions through optimized hardware allocation¹⁶.

In-flight batching and paged attention

Traditional batching waits until a batch fills before processing. In-flight batching (also called continuous batching or iteration-level batching) processes new requests immediately without waiting for existing sequences to complete¹⁷.

How in-flight batching works

The TensorRT-LLM scheduler manages two phases:

Context phase: Processing input prompts, computing initial KV cache entries
Generation phase: Autoregressive token generation using cached keys/values

With in-flight batching, sequences in context phase process together with sequences in generation phase. When a generation sequence finishes, a new context sequence can immediately take its slot. This keeps GPU utilization high even with variable-length inputs and outputs.

Paged KV cache

Instead of allocating KV cache as contiguous memory blocks, paged attention splits the cache into smaller blocks that can be allocated non-contiguously¹⁸. This approach:

Eliminates memory fragmentation from variable sequence lengths
Enables KV cache sharing between requests with common prefixes
Supports block sizes of 8, 16, 32, 64, or 128 tokens per block

The paged KV cache allocates memory upfront in TensorRT-LLM rather than on-demand. Plan for approximately 60% additional VRAM beyond model weights for KV cache allocation¹⁹.

Chunked prefill

Long input contexts can delay generation for concurrent requests. Chunked prefill divides the context phase into smaller chunks, enabling better parallelization with decode operations²⁰.

This feature allows GPU systems to handle longer contexts and higher concurrency by decoupling memory consumption from context length.

Triton Inference Server architecture

Triton provides the serving infrastructure around TensorRT-LLM engines. It handles request routing, model loading, batching, and metrics collection. The OpenAI-compatible frontend transitioned from beta to stable in 2025, enabling drop-in compatibility with OpenAI API clients²¹.

Core components

Model Repository: A directory structure containing model configurations and artifacts. Each model has a config.pbtxt file defining inputs, outputs, and backend settings.

Scheduler: Routes incoming requests to model instances. For LLMs, the scheduler typically delegates batching decisions to the TensorRT-LLM runtime.

Backend Manager: Loads and manages model backends. The TensorRT-LLM backend uses the C++ Executor API for optimal performance.

Metrics Server: Exposes Prometheus-compatible metrics on port 8002 by default.

Security note: Organizations should update to Triton version 25.07 or later to address CVE-2025-23319, CVE-2025-23320, and CVE-2025-23334, which when chained could allow unauthenticated remote code execution²².

Model ensemble structure

The TensorRT-LLM backend uses an ensemble of three models:

model_repository/
├── ensemble/
│   └── config.pbtxt
├── preprocessing/
│   ├── config.pbtxt
│   └── 1/model.py
├── tensorrt_llm/
│   ├── config.pbtxt
│   └── 1/
│       └── [engine files]
└── postprocessing/
    ├── config.pbtxt
    └── 1/model.py

Preprocessing: Tokenizes input text using the model’s tokenizer TensorRT-LLM: Runs inference on the optimized engine Postprocessing: Detokenizes output IDs back to text

The ensemble model chains these together automatically.

Configuring Triton for LLMs

tensorrt_llm model configuration

The core configuration for the TensorRT-LLM backend:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: true
}

input [
  { name: "input_ids" data_type: TYPE_INT32 dims: [-1] },
  { name: "input_lengths" data_type: TYPE_INT32 dims: [1] },
  { name: "request_output_len" data_type: TYPE_INT32 dims: [1] },
  { name: "end_id" data_type: TYPE_INT32 dims: [1] },
  { name: "pad_id" data_type: TYPE_INT32 dims: [1] },
  { name: "stream" data_type: TYPE_BOOL dims: [1] optional: true }
]

output [
  { name: "output_ids" data_type: TYPE_INT32 dims: [-1] },
  { name: "sequence_length" data_type: TYPE_INT32 dims: [1] }
]

parameters {
  key: "gpt_model_path"
  value: { string_value: "/engines/llama-70b-fp8" }
}

parameters {
  key: "batching_type"
  value: { string_value: "inflight_fused_batching" }
}

parameters {
  key: "max_tokens_in_paged_kv_cache"
  value: { string_value: "2048000" }
}

parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value: { string_value: "0.85" }
}

Key parameters explained:

Parameter	Value	Purpose
`decoupled: true`	Required for streaming	Enables async responses
`batching_type`	`inflight_fused_batching`	Enables continuous batching
`max_tokens_in_paged_kv_cache`	Token count	Total KV cache capacity
`kv_cache_free_gpu_mem_fraction`	0.0-1.0	Fraction of free GPU memory for KV cache

Instance configuration

The instance_count parameter controls concurrent execution. For most workloads, set this to 5 or higher²³:

instance_group [
  {
    count: 5
    kind: KIND_GPU
    gpus: [0, 1, 2, 3]
  }
]

Queue management

Control request queuing behavior:

parameters {
  key: "max_queue_delay_microseconds"
  value: { string_value: "100000" }
}

parameters {
  key: "max_queue_size"
  value: { string_value: "256" }
}

Setting max_queue_delay_microseconds above 0 improves batch formation by waiting for additional requests to arrive.

TensorRT-LLM backend setup walkthrough

1. Pull the container

docker pull nvcr.io/nvidia/tritonserver:25.10-py3

2. Create model repository

# Clone the backend repository
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend

# Copy template models
mkdir -p /triton_models
cp -r all_models/inflight_batcher_llm/* /triton_models/

3. Build the TensorRT-LLM engine

Follow the engine building steps from earlier, placing the output in /engines/.

4. Configure the models

Use the provided template filling script:

ENGINE_DIR=/engines/llama-70b-fp8
TOKENIZER_DIR=/models/llama-70b
MAX_BATCH=64

python3 tools/fill_template.py -i /triton_models/preprocessing/config.pbtxt \
  tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${MAX_BATCH}

python3 tools/fill_template.py -i /triton_models/tensorrt_llm/config.pbtxt \
  triton_backend:tensorrtllm,triton_max_batch_size:${MAX_BATCH},\
  decoupled_mode:true,engine_dir:${ENGINE_DIR},\
  batching_type:inflight_fused_batching

python3 tools/fill_template.py -i /triton_models/postprocessing/config.pbtxt \
  tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${MAX_BATCH}

python3 tools/fill_template.py -i /triton_models/ensemble/config.pbtxt \
  triton_max_batch_size:${MAX_BATCH}

5. Launch the server

python3 scripts/launch_triton_server.py \
  --model_repo /triton_models \
  --world_size 4

# Or directly with tritonserver
tritonserver \
  --model-repository=/triton_models \
  --http-port=8000 \
  --grpc-port=8001 \
  --metrics-port=8002

6. Verify deployment

curl localhost:8000/v2/health/ready
# Response: {"ready":true}

curl localhost:8000/v2/models
# Lists all loaded models with READY status

Performance tuning

Concurrency and batch size

The relationship between max_batch_size, engine build parameters, and runtime behavior:

Engine max_batch_size: Hard limit set at build time
Triton max_batch_size: Soft limit for request acceptance
Runtime batch size: Determined by TRT-LLM scheduler based on available requests and memory

The TRT-LLM scheduler can form batches larger than Triton’s max_batch_size when using in-flight batching. The scheduler optimizes based on available KV cache memory and pending requests²⁴.

GPU memory allocation

Balance KV cache size against model memory requirements:

# Calculate KV cache memory
kv_cache_tokens = max_batch_size * max_seq_len
kv_cache_bytes = kv_cache_tokens * num_layers * 2 * hidden_size * dtype_bytes

For a 70B model with 80 layers, 8192 hidden size, and FP16 KV cache:

64 batch * 8192 seq_len = 524,288 tokens
524,288 * 80 * 2 * 8192 * 2 bytes = ~1.3 TB (exceeds GPU memory)

Use kv_cache_free_gpu_mem_fraction to limit allocation to available memory. Start with 0.85 and adjust based on OOM behavior.

CUDA Graphs

TensorRT-LLM uses CUDA Graphs to reduce kernel launch overhead. The runtime captures execution graphs for common batch sizes and replays them without CPU intervention.

CUDA Graph padding handles mismatched batch sizes by padding to the nearest captured graph size. This trades minor compute waste for consistent low-latency execution²⁵.

Multi-GPU configuration

For tensor parallelism across GPUs:

# 4-way tensor parallelism
trtllm-build --tp_size 4 ...

# Launch with matching world size
python3 launch_triton_server.py --world_size 4

The backend uses MPI to coordinate execution across GPUs. Leader mode uses one process to manage inference; Orchestrator mode distributes coordination²⁶.

Monitoring and metrics

Prometheus metrics endpoint

Triton exposes metrics at http://localhost:8002/metrics by default. Enable with:

tritonserver --allow-metrics=true --allow-gpu-metrics=true

Key metrics for LLM workloads

The TensorRT-LLM backend exposes custom metrics:

Metric	Description
`nv_trt_llm_request_count`	Total inference requests
`nv_trt_llm_inflight_request_count`	Currently processing requests
`nv_trt_llm_kv_cache_block_usage`	KV cache utilization
`nv_trt_llm_generation_tokens_per_second`	Output throughput

Standard Triton metrics include:

Metric	Description
`nv_inference_request_success`	Successful request count
`nv_inference_compute_output_duration_us`	Inference latency
`nv_gpu_utilization`	GPU compute utilization
`nv_gpu_memory_used_bytes`	GPU memory consumption

Grafana dashboard setup

NVIDIA provides pre-built Grafana dashboards in the tutorials repository²⁷. Import the JSON configuration:

# Download dashboard JSON
curl -O https://raw.githubusercontent.com/triton-inference-server/tutorials/main/\
Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/\
grafana_inference-metrics_dashboard.json

Configure Prometheus as a data source in Grafana, then import the dashboard JSON.

GenAI-Perf benchmarking

Measure throughput and latency with GenAI-Perf:

genai-perf \
  --model ensemble \
  --backend tensorrtllm \
  --endpoint localhost:8001 \
  --streaming \
  --concurrency 32 \
  --input-sequence-length 512 \
  --output-sequence-length 128

The tool reports time-to-first-token, inter-token latency, throughput, and percentile statistics²⁸.

Comparison with alternatives

LLM Inference Engine Comparison

TensorRT-LLM

vLLM

TGI v3

Throughput (A100 80GB)

700 tok/s @ 100 users

600-650 tok/s @ 100 users

Time to First Token

35-50ms TTFT

50-80ms TTFT

50-70ms TTFT

Setup Complexity

1-2 weeks

1-2 days

Best Use Case

Peak performance

High concurrency

Long contexts

Benchmarks from BentoML testing Llama 3 70B Q4. Results vary by model and hardware.

TensorRT-LLM

Best for maximum throughput when you can invest in setup complexity. January 2026 benchmarks show B200 delivering 60,000 tokens per second per GPU at 1,000 tokens per second per user interactivity on Llama 3.3 70B²⁹. On Llama 4 Scout, B200 achieves over 42,000 tokens per second with a 3.4x performance increase over H200¹.

Strengths: Peak throughput, Tensor Core optimization, deep NVIDIA integration, EAGLE-3 speculative decoding, NVFP4 quantization Trade-offs: Longer setup time (1-2 weeks for production tuning), engine rebuilds required for config changes

vLLM

Best for teams prioritizing development velocity and concurrency handling. PagedAttention provides GPU-friendly memory layout without fragmentation³⁰. vLLM v0.14.1 (January 2026) adds W4A8 grouped GEMM on Hopper, MoE + LoRA support with AWQ Marlin, and Transformers v5 RoPE compatibility³¹.

Inferact Inc. launched in January 2026 to commercialize vLLM with $150M funding at an$ 800M valuation³². The vLLM project now runs on over 400,000 GPUs worldwide.

Strengths: Easy setup (1-2 days), consistent latency under load, active community, broad hardware support (NVIDIA, AMD, Intel, TPU) Trade-offs: Lower peak throughput than TensorRT-LLM on Blackwell GPUs

SGLang

SGLang has become a production standard, generating trillions of tokens daily across deployments at xAI, AMD, NVIDIA, LinkedIn, Cursor, and major cloud providers³³. On H100 hardware, SGLang achieves 16,215 tokens per second, matching LMDeploy and exceeding vLLM by 29%³⁴.

Strengths: RadixAttention for prefix caching (50%+ hit rates in production), zero-overhead CPU scheduler, prefill-decode disaggregation, native C++ optimization Trade-offs: Smaller ecosystem than vLLM

Text-Generation-Inference v3

TGI is now in maintenance mode, with Hugging Face recommending vLLM or SGLang for new deployments³⁵. TGI v3 achieves 13x speedups on prompts exceeding 200,000 tokens by reusing earlier token computations³⁶.

Strengths: Prefix caching, long context handling, Hugging Face ecosystem, zero-config mode Trade-offs: Maintenance mode limits new feature development

Decision matrix

Requirement	Recommended Engine
Maximum throughput on Blackwell	TensorRT-LLM
Fast iteration cycles	vLLM or SGLang
High concurrency, consistent latency	vLLM
Prefix caching in production	SGLang
Long chat histories (200K+ tokens)	TGI v3
Kubernetes-native deployment	Triton + TensorRT-LLM
Datacenter-scale orchestration	NVIDIA Dynamo
Multi-vendor hardware support	vLLM

Complete deployment example

End-to-end deployment of Llama 3.1 70B on 4x H100 80GB:

Environment setup

# Pull containers
docker pull nvcr.io/nvidia/tritonserver:25.10-py3
docker pull nvcr.io/nvidia/pytorch:25.10-py3

# Download model
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
  --local-dir /models/llama-3.1-70b

Build engine

docker run --gpus all -v /models:/models -v /engines:/engines \
  nvcr.io/nvidia/pytorch:25.10-py3 bash -c "
    pip install tensorrt-llm

    # Quantize to FP8
    python -m tensorrt_llm.commands.quantize \
      --model_dir /models/llama-3.1-70b \
      --qformat fp8 \
      --kv_cache_dtype fp8 \
      --output_dir /engines/llama-70b-ckpt \
      --tp_size 4

    # Build engine
    trtllm-build \
      --checkpoint_dir /engines/llama-70b-ckpt \
      --output_dir /engines/llama-70b-engine \
      --gemm_plugin fp8 \
      --gpt_attention_plugin fp8 \
      --max_batch_size 64 \
      --max_input_len 4096 \
      --max_seq_len 8192 \
      --use_fused_mlp enable \
      --workers 4
"

Deploy with Triton

docker run --gpus all -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /engines:/engines \
  -v /models:/models \
  -v /triton_models:/triton_models \
  nvcr.io/nvidia/tritonserver:25.10-py3 bash -c "
    # Configure models (using fill_template.py as shown earlier)
    # ...

    tritonserver \
      --model-repository=/triton_models \
      --allow-metrics=true \
      --allow-gpu-metrics=true
"

Test inference

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# Prepare inputs
prompt = "Explain the attention mechanism in transformers:"
input_ids = tokenizer.encode(prompt)  # Use appropriate tokenizer

inputs = [
    grpcclient.InferInput("text_input", [1], "BYTES"),
    grpcclient.InferInput("max_tokens", [1], "INT32"),
    grpcclient.InferInput("stream", [1], "BOOL"),
]

inputs[0].set_data_from_numpy(np.array([[prompt]], dtype=object))
inputs[1].set_data_from_numpy(np.array([[256]], dtype=np.int32))
inputs[2].set_data_from_numpy(np.array([[True]], dtype=bool))

# Stream responses
for response in client.infer("ensemble", inputs, stream=True):
    output = response.as_numpy("text_output")
    print(output[0].decode(), end="", flush=True)

Blackwell GPU deployment

NVIDIA Blackwell architecture (B200, GB200, GB300) delivers substantial inference improvements over Hopper:

Metric	B200 vs H200
Tokens per second (Llama 3.3 70B)	4x higher throughput²⁹
Tokens per second (Llama 4 Scout)	3.4x improvement¹
DeepSeek-R1 inference	5x vs Hopper per GPU³⁷
Memory per GPU	Up to 288GB HBM3e (Blackwell Ultra)

The GB200 NVL72 rack-scale platform connects 72 Blackwell GPUs using fifth-generation NVLink with 1,800 GB/s bidirectional bandwidth between all chips³⁸. Blackwell Ultra (GB300) delivers 45% higher DeepSeek-R1 throughput than GB200 with 1.5x more NVFP4 compute and 2x more attention-layer acceleration³⁷.

Key Blackwell optimizations in TensorRT-LLM:

NVFP4 quantization for weights and KV cache
Programmatic dependent launch (PDL) for reduced kernel latencies
Enhanced all-to-all communication primitives
CuteDSL NVFP4 grouped GEMM integration

The January 2026 TensorRT-LLM optimizations delivered up to 2.8x throughput increases per Blackwell GPU over the previous three months³⁹.

Summary

TensorRT-LLM and Triton Inference Server provide production-grade LLM serving with:

Engine optimization: NVFP4/FP8 quantization, kernel fusion, custom attention kernels, and EAGLE-3 speculative decoding
Efficient batching: In-flight batching and paged KV cache for high throughput
Disaggregated serving: Separate prefill and decode phases for optimized resource utilization
Flexible deployment: Multi-GPU, multi-node support with MPI coordination and NVIDIA Dynamo orchestration
Observability: Prometheus metrics integration for monitoring

The setup complexity is higher than alternatives like vLLM or SGLang, but peak performance on Blackwell GPUs is also higher. For teams already in the NVIDIA ecosystem deploying at scale, this stack delivers consistent, optimized inference. Organizations prioritizing faster iteration should evaluate vLLM (broad hardware support, active development) or SGLang (production-proven prefix caching).

NVIDIA TensorRT and Triton: Production LLM Inference

TensorRT-LLM overview

Building TensorRT engines for LLMs

Step 1: Quantize and convert checkpoints

Step 2: Build the engine

Layer fusion patterns

Disaggregated serving

In-flight batching and paged attention

How in-flight batching works

Paged KV cache

Chunked prefill

Triton Inference Server architecture

Core components

Model ensemble structure

Configuring Triton for LLMs

tensorrt_llm model configuration

Instance configuration

Queue management

TensorRT-LLM backend setup walkthrough

1. Pull the container

2. Create model repository

3. Build the TensorRT-LLM engine

4. Configure the models

5. Launch the server

6. Verify deployment

Performance tuning

Concurrency and batch size

GPU memory allocation

CUDA Graphs

Multi-GPU configuration

Monitoring and metrics

Prometheus metrics endpoint

Key metrics for LLM workloads

Grafana dashboard setup

GenAI-Perf benchmarking

Comparison with alternatives

TensorRT-LLM

vLLM

SGLang

Text-Generation-Inference v3

Decision matrix

Complete deployment example

Environment setup

Build engine

Deploy with Triton

Test inference

Blackwell GPU deployment

Summary

References

2026 updates

Footnotes