Support features of cut cross entropy, TiledMLP and activation_offload #7129

vx120 · 2025-12-19T10:59:19Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

It includes multiple features.
cut cross entropy:
commit 2dee5e5

TiledMLP:
commit e64c462

activation_offload:
commit 8b8deb6

gemini-code-assist · 2025-12-19T10:59:50Z

Summary of Changes

Hello @vx120, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the training framework by integrating advanced memory optimization and acceleration features. It introduces 'cut cross entropy' and 'Tiled MLP' to handle large models and long sequences more efficiently, and 'activation CPU offloading' to further reduce GPU memory consumption. Additionally, it adds robust support for elastic training through DLRover, allowing for more flexible and resilient training workflows in dynamic environments.

Highlights

Memory Optimization Techniques: Introduced support for 'cut cross entropy' (CCE) fused operator and 'Tiled MLP' for memory-efficient long sequence training, aiming to reduce GPU memory usage and accelerate training.
Activation Offloading: Added 'activation CPU offloading' as a memory optimization technique that moves activation tensors to CPU memory during the forward pass and reloads them for the backward pass, significantly reducing GPU memory footprint.
Elastic Training with DLRover: Implemented support for elastic training using DLRover, allowing for dynamic adjustment of training parameters like batch size based on available resources and enabling graceful exits.
Documentation and Examples: Provided comprehensive documentation in both Chinese and English for elastic training and activation offloading, along with example scripts for CCE, Tiled MLP, and activation offloading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several significant features, including cut-cross-entropy (CCE), TiledMLP, activation offloading, and elastic training support. The changes are extensive, covering documentation, examples, and core library code. Overall, the implementation of these new features is robust and well-structured. The new documentation is comprehensive, although there are some minor formatting issues and errors in the provided examples and command descriptions that need attention. The core code additions, particularly for activation offloading and TiledMLP, are complex but appear to be correctly implemented, with thoughtful handling of different distributed training strategies like FSDP2 and DeepSpeed. My review focuses on enhancing documentation clarity, correcting errors in example scripts, and addressing minor code cleanup opportunities such as removing leftover comments. A critical issue involving a merge conflict marker in a documentation file has also been identified and must be resolved.

gemini-code-assist · 2025-12-19T11:02:21Z

docs/source_en/Instruction/Command-line-parameters.md

 - 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger).
  - Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`.
+- use_cce: Whether to enable the [cut-cross-entropy](https://github.com/apple/ml-cross-entropy) fused operator to reduce GPU memory usage and accelerate training. Defaults to `False`. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/cce).
+  =======


This line ======= appears to be a merge conflict marker that was accidentally committed. It should be removed.

gemini-code-assist · 2025-12-19T11:02:21Z

examples/train/activation_cpu_offload/train.sh

+    --torch_dtype bfloat16 \
+    --per_device_train_batch_size 10 \
+    --gradient_accumulation_steps 2 \
+    --gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU


The comment // no need to checkpoint activations when offloading to CPU uses a C-style comment //, which is not valid in shell scripts and will cause a syntax error. It is also placed after a line continuation character \, which is invalid. The comment should be moved to its own line before this one, using #.

Suggested change

--gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU

--gradient_checkpointing false \

gemini-code-assist · 2025-12-19T11:02:22Z

docs/source/BestPractices/Elastic.md

+deepspeed_config_or_type=deepspeed类型或者配置文件的路径，如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json
+
+dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1  \
+/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \


The path to sft.py is hardcoded to a specific conda environment. This approach is brittle and not easily portable. It would be more robust to use the -m flag of dlrover-run to execute the module directly.

Suggested change

/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \

-m swift.cli.sft --model $model \

gemini-code-assist · 2025-12-19T11:02:22Z

docs/source/Instruction/Command-line-parameters.md

 - eval_generation_config: 评测时模型推理配置，json格式，默认为`{'max_tokens': 512}`。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`，启用后，权重会先保存至共享内存，之后异步持久化，目前暂不支持safetensors格式；建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用，避免训练过程CUDA OOM。
+- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`，启用后，权重会先保存至共享内存，之后异步持久化；建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用，避免训练过程CUDA OOM。
+- elastic: 是否启用弹性，依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `，具体使用参考[示例](../BestPractices/Elastic.md)


The formatting for the pip command is incorrect, which may cause rendering issues. Additionally, there are some punctuation inconsistencies, such as using English commas instead of Chinese ones.

Suggested change

- elastic: 是否启用弹性，依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `，具体使用参考[示例](../BestPractices/Elastic.md)

- elastic: 是否启用弹性，依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover)，`pip install dlrover && pip install tornado && pip install kubernetes`，具体使用参考[示例](../BestPractices/Elastic.md)

gemini-code-assist · 2025-12-19T11:02:22Z

docs/source_en/BestPractices/Elastic.md

+
+## Installing Dependencies
+
+Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`


The pip install command is not correctly formatted within backticks, and it is missing a closing backtick. This should be corrected for proper rendering.

Suggested change

Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`

Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`.

gemini-code-assist · 2025-12-19T11:02:22Z

docs/source_en/BestPractices/Elastic.md

+          restartPolicy: Never
+          containers:
+            - name: main
+              image: #【Training image, needs to have deepspeed, dlrover, and swift installed】


The comment uses Chinese-style brackets 【】. For consistency in English documentation, it's better to use standard comment conventions, such as parentheses or plain text.

Suggested change

image: #【Training image, needs to have deepspeed, dlrover, and swift installed】

image: # Training image, needs to have deepspeed, dlrover, and swift installed

gemini-code-assist · 2025-12-19T11:02:22Z

docs/source_en/Instruction/Command-line-parameters.md

 - eval_limit: Number of samples from the evaluation dataset
 - eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
 - use_flash_ckpt: Whether to use [DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover). Default is `false`. If enabled, checkpoints are saved to memory synchronously, then persisted to storage asynchronously, the safetensors format is not supported currently. It's recommended to use this with the environment variable `PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` to avoid CUDA OOM.
+  elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md)


The formatting for the pip command is incorrect, which may cause rendering issues. It's missing a closing backtick, and there should be a space after the comma for better readability.

Suggested change

elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md)

elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`. For specific usage, refer to the [example](../BestPractices/Elastic.md)

gemini-code-assist · 2025-12-19T11:02:22Z

swift/plugin/activation_cpu_offload.py

+        binded_tensor = ret
+        if isinstance(ret, tuple):
+            binded_tensor = ret[0]
+        #即 GroupCommitFunction.apply，即调用算子GroupCommitFunction(binded_tensor),


This comment is in Chinese. For consistency with the rest of the codebase, it should be translated to English or removed if it is not essential for understanding the code.

Suggested change

#即 GroupCommitFunction.apply，即调用算子GroupCommitFunction(binded_tensor),

# This is GroupCommitFunction.apply, which calls the operator GroupCommitFunction(binded_tensor).

gemini-code-assist · 2025-12-19T11:02:22Z

swift/plugin/activation_cpu_offload.py

+            handler.post_forward(model_self)
+            return out
+
+        #普通方法绑定为module 的方法，因此完成了原始module.forward的包裹


This comment is in Chinese. To maintain a consistent language in the codebase, please translate it to English or remove it if it's not necessary.

Suggested change

#普通方法绑定为module 的方法，因此完成了原始module.forward的包裹

# Bind the plain method as a method of the module, thus wrapping the original module.forward.

gemini-code-assist · 2025-12-19T11:02:22Z

swift/plugin/activation_cpu_offload.py

+    # get_layers(model)
+    # if len(layers) < 3:
+    #     logger.warning(f"Find only {len(layers)} fsdp layers, not neccessary to enable async activation offloading")
+    #     return


These lines appear to be commented-out debug code. They should be removed to improve code clarity and maintainability.

meichangsu1 and others added 21 commits December 8, 2025 22:20

support deepspeed elastic

6cdaad8

code refactor

7bb510a

code refactor

5316bb2

[feat] Add support cut-cross-entropy

8bbb327

[misc] add use_cce test

73773fc

[misc] fix code-assist bot problem

ecbd2ae

Merge branch 'modelscope:main' into main

5d3308b

[misc] add docs & example

af10ed4

Merge branch 'modelscope:main' into main

5be67fe

tiled mlp

e5229ef

lint fix

d0cb8bd

feat: use Axolotl fork to support more models

4efc13a

update docs

7d11e87

Limit CCE model mapping to Liger-supported types

3ec4180

npu support

4700709

Merge branch 'modelscope:main' into feat/tiled_mlp

8901b38

[feat] support activation cpu offload in fsdp and fsdp2

9a4179d

Merge feature cut_cross_entropy

2dee5e5

[feat] support activation cpu offload in fsdp and fsdp2 lint fix

ab26acc

Merge feature TiledMLP

e64c462

Merge feature activation_offload

8b8deb6

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support features of cut cross entropy, TiledMLP and activation_offload #7129

Support features of cut cross entropy, TiledMLP and activation_offload #7129

vx120 commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	--gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU
	--gradient_checkpointing false \

	/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
	-m swift.cli.sft --model $model \

	- elastic: 是否启用弹性，依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `，具体使用参考[示例](../BestPractices/Elastic.md)
	- elastic: 是否启用弹性，依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover)，`pip install dlrover && pip install tornado && pip install kubernetes`，具体使用参考[示例](../BestPractices/Elastic.md)


		## Installing Dependencies

		Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`

	image: #【Training image, needs to have deepspeed, dlrover, and swift installed】
	image: # Training image, needs to have deepspeed, dlrover, and swift installed

	elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md)
	elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`. For specific usage, refer to the [example](../BestPractices/Elastic.md)

	#即 GroupCommitFunction.apply，即调用算子GroupCommitFunction(binded_tensor),
	# This is GroupCommitFunction.apply, which calls the operator GroupCommitFunction(binded_tensor).

	#普通方法绑定为module 的方法，因此完成了原始module.forward的包裹
	# Bind the plain method as a method of the module, thus wrapping the original module.forward.

Support features of cut cross entropy, TiledMLP and activation_offload #7129

Are you sure you want to change the base?

Support features of cut cross entropy, TiledMLP and activation_offload #7129

Conversation

vx120 commented Dec 19, 2025

PR type

PR information

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants