NVIDIA · kevalmorabia97 · Mar 20, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,7 +1,18 @@
 NVIDIA Model Optimizer Changelog
 ================================
 
-0.43 (2026-03-xx)
+0.44 (2026-05-xx)
+^^^^^^^^^^^^^^^^^
+
+**New Features**
+
+- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
+
+**Bug Fixes**
+
+- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
+
+0.43 (2026-04-09)
 ^^^^^^^^^^^^^^^^^
 
 **Bug Fixes**
@@ -39,7 +50,7 @@ NVIDIA Model Optimizer Changelog
 - Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
 - Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
 
-0.42 (2026-02-xx)
+0.42 (2026-03-10)
 ^^^^^^^^^^^^^^^^^
 
 **Bug Fixes**

@@ -16,17 +16,17 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
 
 ## Pre-Requisites
 
-Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
+Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02.01`) which has all the dependencies installed.
 
-To get the latest ModelOpt features and examples scripts, mount your Model-Optimizer repo to the container.
+To get the ModelOpt examples scripts, mount your Model-Optimizer repo to the container as follows:
 
 ```bash
 export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
 if [ ! -d "${MODELOPT_DIR}" ]; then
   git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
 fi
 
-export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
+export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02.01
 docker run \
   --gpus all \
   --shm-size=16GB \

@@ -240,8 +240,9 @@ def main(args: argparse.Namespace):
             "seq_length": args.seq_length,
         },
         init_model_parallel=True,
+        moe_grouped_gemm=False,
     )
-    print_rank_0(f"\nPruning {unwrapped_model=}")
+    print_rank_0(f"\nPruning model (showing PP rank0): {unwrapped_model}")
     print_rank_0(
         f"Original model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
     )
@@ -264,10 +265,11 @@ def main(args: argparse.Namespace):
     }
     if args.prune_target_params is not None:
         # Restrict search space to a smaller set of candidates
+        # Allow more choices for MoE FFN as they are generally smaller
         # NOTE: You can reduce the divisors and increase config['top_k'] to potentially find a better model.
         ss_config = mtp.mcore_minitron.get_mcore_minitron_config(
             hidden_size_divisor=256,
-            ffn_hidden_size_divisor=512,
+            ffn_hidden_size_divisor=256 if (provider.num_moe_experts or 0) > 0 else 512,
             mamba_head_dim_divisor=8,
             num_moe_experts_divisor=8,
             num_layers_divisor=2,
@@ -317,7 +319,7 @@ def score_func_mmlu(m):
             else "hybrid_layer_pattern"
         )
         setattr(provider, hybrid_key, getattr(unwrapped_model, hybrid_key))
-    print_rank_0(f"\nPruned {unwrapped_model=}")
+    print_rank_0(f"\nPruned model (showing PP rank0): {unwrapped_model}")
     print_rank_0(
         f"Pruned model params: {num2hrb(mtp.mcore_minitron.get_mcore_param_count(unwrapped_model))}"
     )

@@ -64,6 +64,7 @@ bridge, provider, model, unwrapped_model, tokenizer = load_mbridge_model_from_hf
         "pipeline_dtype": torch.bfloat16,
         "seq_length": 4096,
     },
+    moe_grouped_gemm=False,
 )
 
 # Set up the forward loop to run on 1024 train samples