Releases · NVIDIA/Model-Optimizer

20 Jan 17:10

kevalmorabia97

0.41.0

d39cf45

ModelOpt 0.41.0 Release Latest

Latest

Bug Fixes

Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

New Features

Add support for Transformer Engine quantization for Megatron Core models.
Add support for Qwen3-Next model quantization.
Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
Add support for subgraphs in ONNX autocast.
Add support for parallel draft heads in Eagle speculative decoding.
Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
Add examples/llm_qad for QAD training with Megatron-LM.

Deprecations

Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.

Backward Breaking Changes

Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.

Assets 3

20 Jan 05:12

kevalmorabia97

0.41.0rc3

d39cf45

0.41.0rc3 Pre-release

Pre-release

0.41.0rc3

Assets 3

14 Jan 04:59

kevalmorabia97

0.41.0rc2

41aaec5

0.41.0rc2 Pre-release

Pre-release

0.41.0rc2

Assets 3

05 Jan 13:54

kevalmorabia97

0.41.0rc1

8426c36

0.41.0rc1 Pre-release

Pre-release

0.41.0rc1

Assets 3

12 Dec 10:27

kevalmorabia97

0.40.0

411912e

ModelOpt 0.40.0 Release

Bug Fixes

Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.

New Features

Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for num_moe_experts, moe_ffn_hidden_size, and moe_shared_expert_intermediate_size parameters in Minitron pruning (mcore_minitron).
Add specdec_bench example to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details.
Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
Add flag trt_plugins_precision in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
Add support for PyTorch Geometric quantization.
Add per tensor and per channel MSE calibrator support.
Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.

Documentation

Deprecate examples/megatron-lm in favor of more detailed documentation in Megatron-LM/examples/post_training/modelopt.

Misc

NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer.
Bump TensorRT-LLM test docker to 1.2.0rc4.
Bump minimum recommended transformers version to 4.53.
Replace ONNX simplification package from onnxsim to onnxslim.

Assets 2

13 Nov 07:25

kevalmorabia97

0.39.0

f329b19

ModelOpt 0.39.0 Release

Deprecations

Deprecated modelopt.torch._deploy.utils.get_onnx_bytes API. Please use modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata instead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.

New Features

Added flag op_types_to_exclude_fp16 in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating 'fp32' precision in trt_plugins_precision.
Added LoRA mode support for MCore in a new peft submodule: modelopt.torch.peft.update_model(model, LORA_CFG).
Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See examples/vllm_serve for more details.
Added support for nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in examples/llm_ptq. Defaults to a mix of cnn_dailymail and nemotron-post-training-dataset-v2 (gated dataset accessed using the HF_TOKEN environment variable) if no dataset is specified.
Allows specifying calib_seq in examples/llm_ptq to set the maximum sequence length for calibration.
Added support for MCore MoE PTQ/QAT/QAD.
Added support for multi-node PTQ and export with FSDP2 in examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.
Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
Added flags nodes_to_include and op_types_to_include in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
Added support for torch.compile and benchmarking in examples/diffusers/quantization/diffusion_trt.py.
Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
Added support for exporting QLoRA checkpoints finetuned using ModelOpt.

Documentation

Added general guidelines for Minitron pruning and distillation. See examples/pruning/README.md for more details.
Added example for exporting QLoRA checkpoints for vLLM deployment. Refer to examples/llm_qat/README.md for more details.

Additional Announcements

ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named 0.40.0 instead of 0.41.0

Assets 3

08 Oct 16:43

kevalmorabia97

0.37.0

df0882a

ModelOpt 0.37.0 Release

Deprecations

Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
Deprecated quantize_mode argument in examples/onnx_ptq/evaluate.py to support strong typing. Use engine_precision instead.
Deprecated TRT-LLM's TRT backend in examples/llm_ptq and examples/vlm_ptq. Tasks build and benchmark support are removed and replaced with quant. engine_dir is replaced with checkpoint_dir in examples/llm_ptq and examples/vlm_ptq. For performance evaluation, please use trtllm-bench directly.
The --export_fmt flag in examples/llm_ptq is removed. By default, we export to the unified Hugging Face checkpoint format.
Deprecated examples/vlm_eval as it depends on the deprecated TRT-LLM's TRT backend.

New Features

high_precision_dtype defaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.
Upgraded TensorRT-LLM dependency to 1.1.0rc2.
Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in examples/vlm_ptq.
Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
Added Minitron pruning example for the Megatron-LM framework. See examples/megatron-lm for more details.

Assets 2

20 Sep 08:32

kevalmorabia97

0.35.1

0365238

ModelOpt 0.35.1 Release

Import fixes

Assets 2

04 Sep 05:50

kevalmorabia97

0.35.0

c359cb7

ModelOpt 0.35.0 Release

Deprecations

Deprecate torch<2.6 support.
Deprecate NeMo 1.0 model support.

Bug Fixes

Fix attention head ranking logic for pruning Megatron Core GPT models.

New Features

ModelOpt now supports PTQ and QAT for GPT-OSS models. See examples/gpt_oss for end-to-end PTQ/QAT example.
Add support for QAT with HuggingFace + DeepSpeed. See examples/gpt_oss for an example.
Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See examples/gpt_oss for an example.
ModelOpt provides convenient trainers such as :class:QATTrainer, :class:QADTrainer, :class:KDTrainer, :class:QATSFTTrainer which inherits from Huggingface trainers.
ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples in examples/gpt_oss, examples/llm_qat or examples/llm_distill.
(Experimental) Add quantization support for custom TensorRT op in ONNX models.
Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
Add tree decoding support for Megatron Eagle models.
For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
Add support for mamba_num_heads, mamba_head_dim, hidden_size and num_layers pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in mcore_minitron (previously mcore_gpt_minitron) mode.
Add example for QAT/QAD training with LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>_. See examples/llm_qat/llama_factory for more details.
Upgrade TensorRT-LLM dependency to 1.0.0rc6.
Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.

Assets 2

12 Aug 18:50

kevalmorabia97

0.33.1

55b9106

ModelOpt 0.33.1 Release

Bug Fixes

Fix a Qwen3 MOE model export issue.

Assets 2

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.41.0 Release

Bug Fixes

New Features

Deprecations

Backward Breaking Changes

Uh oh!

0.41.0rc3

Uh oh!

0.41.0rc2

Uh oh!

0.41.0rc1

Uh oh!

ModelOpt 0.40.0 Release

Bug Fixes

New Features

Documentation

Misc

Uh oh!

ModelOpt 0.39.0 Release

Deprecations

New Features

Documentation

Additional Announcements

Uh oh!

ModelOpt 0.37.0 Release

Deprecations

New Features

Uh oh!

ModelOpt 0.35.1 Release

Uh oh!

ModelOpt 0.35.0 Release

Uh oh!

ModelOpt 0.33.1 Release

Uh oh!