Releases: NVIDIA/Model-Optimizer
Releases · NVIDIA/Model-Optimizer
ModelOpt 0.41.0 Release
Bug Fixes
- Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).
New Features
- Add support for Transformer Engine quantization for Megatron Core models.
- Add support for Qwen3-Next model quantization.
- Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
- Add support for subgraphs in ONNX autocast.
- Add support for parallel draft heads in Eagle speculative decoding.
- Add support to enable custom emulated quantization backend. See
register_quant_backendfor more details. See an example intests/unit/torch/quantization/test_custom_backend.py. - Add
examples/llm_qadfor QAD training with Megatron-LM.
Deprecations
- Deprecate
num_query_groupsparameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.
Backward Breaking Changes
- Remove
torchprofileas a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.
0.41.0rc3
0.41.0rc3
0.41.0rc2
0.41.0rc2
0.41.0rc1
0.41.0rc1
ModelOpt 0.40.0 Release
Bug Fixes
- Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
- Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.
New Features
- Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for
num_moe_experts,moe_ffn_hidden_size, andmoe_shared_expert_intermediate_sizeparameters in Minitron pruning (mcore_minitron). - Add
specdec_benchexample to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details. - Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
- Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
- Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
- Add flag
trt_plugins_precisionin ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow. - Add support for PyTorch Geometric quantization.
- Add per tensor and per channel MSE calibrator support.
- Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.
Documentation
- Deprecate
examples/megatron-lmin favor of more detailed documentation in Megatron-LM/examples/post_training/modelopt.
Misc
- NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (
NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer. - Bump TensorRT-LLM test docker to 1.2.0rc4.
- Bump minimum recommended transformers version to 4.53.
- Replace ONNX simplification package from
onnxsimtoonnxslim.
ModelOpt 0.39.0 Release
Deprecations
- Deprecated
modelopt.torch._deploy.utils.get_onnx_bytesAPI. Please usemodelopt.torch._deploy.utils.get_onnx_bytes_and_metadatainstead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.
New Features
- Added flag
op_types_to_exclude_fp16in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating'fp32'precision intrt_plugins_precision. - Added LoRA mode support for MCore in a new peft submodule:
modelopt.torch.peft.update_model(model, LORA_CFG). - Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See
examples/vllm_servefor more details. - Added support for
nemotron-post-training-dataset-v2andnemotron-post-training-dataset-v1inexamples/llm_ptq. Defaults to a mix ofcnn_dailymailandnemotron-post-training-dataset-v2(gated dataset accessed using theHF_TOKENenvironment variable) if no dataset is specified. - Allows specifying
calib_seqinexamples/llm_ptqto set the maximum sequence length for calibration. - Added support for MCore MoE PTQ/QAT/QAD.
- Added support for multi-node PTQ and export with FSDP2 in
examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details. - Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
- Added flags
nodes_to_includeandop_types_to_includein AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules. - Added support for
torch.compileand benchmarking inexamples/diffusers/quantization/diffusion_trt.py. - Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
- Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
- Added support for exporting QLoRA checkpoints finetuned using ModelOpt.
Documentation
- Added general guidelines for Minitron pruning and distillation. See examples/pruning/README.md for more details.
- Added example for exporting QLoRA checkpoints for vLLM deployment. Refer to examples/llm_qat/README.md for more details.
Additional Announcements
- ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named
0.40.0instead of0.41.0
ModelOpt 0.37.0 Release
Deprecations
- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
- Deprecated
quantize_modeargument inexamples/onnx_ptq/evaluate.pyto support strong typing. Useengine_precisioninstead. - Deprecated TRT-LLM's TRT backend in
examples/llm_ptqandexamples/vlm_ptq. Tasksbuildandbenchmarksupport are removed and replaced withquant.engine_diris replaced withcheckpoint_dirinexamples/llm_ptqandexamples/vlm_ptq. For performance evaluation, please usetrtllm-benchdirectly. - The
--export_fmtflag inexamples/llm_ptqis removed. By default, we export to the unified Hugging Face checkpoint format. - Deprecated
examples/vlm_evalas it depends on the deprecated TRT-LLM's TRT backend.
New Features
high_precision_dtypedefaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.- Upgraded TensorRT-LLM dependency to 1.1.0rc2.
- Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in
examples/vlm_ptq. - Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
- Added Minitron pruning example for the Megatron-LM framework. See
examples/megatron-lmfor more details.
ModelOpt 0.35.1 Release
- Import fixes
ModelOpt 0.35.0 Release
Deprecations
- Deprecate
torch<2.6support. - Deprecate NeMo 1.0 model support.
Bug Fixes
- Fix attention head ranking logic for pruning Megatron Core GPT models.
New Features
- ModelOpt now supports PTQ and QAT for GPT-OSS models. See
examples/gpt_ossfor end-to-end PTQ/QAT example. - Add support for QAT with HuggingFace + DeepSpeed. See
examples/gpt_ossfor an example. - Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See
examples/gpt_ossfor an example. - ModelOpt provides convenient trainers such as :class:
QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainerwhich inherits from Huggingface trainers.
ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples inexamples/gpt_oss,examples/llm_qatorexamples/llm_distill. - (Experimental) Add quantization support for custom TensorRT op in ONNX models.
- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
- Add tree decoding support for Megatron Eagle models.
- For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
- Add support for
mamba_num_heads,mamba_head_dim,hidden_sizeandnum_layerspruning for Megatron Core Mamba or Hybrid Transformer Mamba models inmcore_minitron(previouslymcore_gpt_minitron) mode. - Add example for QAT/QAD training with
LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. Seeexamples/llm_qat/llama_factoryfor more details. - Upgrade TensorRT-LLM dependency to 1.0.0rc6.
- Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.
ModelOpt 0.33.1 Release
Bug Fixes
- Fix a Qwen3 MOE model export issue.