Skip to content

Conversation

@vx120
Copy link
Contributor

@vx120 vx120 commented Dec 19, 2025

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

It includes multiple features.
cut cross entropy:
commit 2dee5e5

TiledMLP:
commit e64c462

activation_offload:
commit 8b8deb6

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vx120, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the training framework by integrating advanced memory optimization and acceleration features. It introduces 'cut cross entropy' and 'Tiled MLP' to handle large models and long sequences more efficiently, and 'activation CPU offloading' to further reduce GPU memory consumption. Additionally, it adds robust support for elastic training through DLRover, allowing for more flexible and resilient training workflows in dynamic environments.

Highlights

  • Memory Optimization Techniques: Introduced support for 'cut cross entropy' (CCE) fused operator and 'Tiled MLP' for memory-efficient long sequence training, aiming to reduce GPU memory usage and accelerate training.
  • Activation Offloading: Added 'activation CPU offloading' as a memory optimization technique that moves activation tensors to CPU memory during the forward pass and reloads them for the backward pass, significantly reducing GPU memory footprint.
  • Elastic Training with DLRover: Implemented support for elastic training using DLRover, allowing for dynamic adjustment of training parameters like batch size based on available resources and enabling graceful exits.
  • Documentation and Examples: Provided comprehensive documentation in both Chinese and English for elastic training and activation offloading, along with example scripts for CCE, Tiled MLP, and activation offloading.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant features, including cut-cross-entropy (CCE), TiledMLP, activation offloading, and elastic training support. The changes are extensive, covering documentation, examples, and core library code. Overall, the implementation of these new features is robust and well-structured. The new documentation is comprehensive, although there are some minor formatting issues and errors in the provided examples and command descriptions that need attention. The core code additions, particularly for activation offloading and TiledMLP, are complex but appear to be correctly implemented, with thoughtful handling of different distributed training strategies like FSDP2 and DeepSpeed. My review focuses on enhancing documentation clarity, correcting errors in example scripts, and addressing minor code cleanup opportunities such as removing leftover comments. A critical issue involving a merge conflict marker in a documentation file has also been identified and must be resolved.

- 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger).
- Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`.
- use_cce: Whether to enable the [cut-cross-entropy](https://github.com/apple/ml-cross-entropy) fused operator to reduce GPU memory usage and accelerate training. Defaults to `False`. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/cce).
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line ======= appears to be a merge conflict marker that was accidentally committed. It should be removed.

--torch_dtype bfloat16 \
--per_device_train_batch_size 10 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The comment // no need to checkpoint activations when offloading to CPU uses a C-style comment //, which is not valid in shell scripts and will cause a syntax error. It is also placed after a line continuation character \, which is invalid. The comment should be moved to its own line before this one, using #.

Suggested change
--gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU
--gradient_checkpointing false \

deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json

dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to sft.py is hardcoded to a specific conda environment. This approach is brittle and not easily portable. It would be more robust to use the -m flag of dlrover-run to execute the module directly.

Suggested change
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
-m swift.cli.sft --model $model \

- eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化,目前暂不支持safetensors格式;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The formatting for the pip command is incorrect, which may cause rendering issues. Additionally, there are some punctuation inconsistencies, such as using English commas instead of Chinese ones.

Suggested change
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md)
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover)`pip install dlrover && pip install tornado && pip install kubernetes`,具体使用参考[示例](../BestPractices/Elastic.md)


## Installing Dependencies

Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pip install command is not correctly formatted within backticks, and it is missing a closing backtick. This should be corrected for proper rendering.

Suggested change
Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`
Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`.

restartPolicy: Never
containers:
- name: main
image: #【Training image, needs to have deepspeed, dlrover, and swift installed】
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment uses Chinese-style brackets 【】. For consistency in English documentation, it's better to use standard comment conventions, such as parentheses or plain text.

Suggested change
image: #Training image, needs to have deepspeed, dlrover, and swift installed
image: # Training image, needs to have deepspeed, dlrover, and swift installed

- eval_limit: Number of samples from the evaluation dataset
- eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}`
- use_flash_ckpt: Whether to use [DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover). Default is `false`. If enabled, checkpoints are saved to memory synchronously, then persisted to storage asynchronously, the safetensors format is not supported currently. It's recommended to use this with the environment variable `PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` to avoid CUDA OOM.
elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The formatting for the pip command is incorrect, which may cause rendering issues. It's missing a closing backtick, and there should be a space after the comma for better readability.

Suggested change
elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md)
elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`. For specific usage, refer to the [example](../BestPractices/Elastic.md)

binded_tensor = ret
if isinstance(ret, tuple):
binded_tensor = ret[0]
#即 GroupCommitFunction.apply,即调用算子GroupCommitFunction(binded_tensor),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese. For consistency with the rest of the codebase, it should be translated to English or removed if it is not essential for understanding the code.

Suggested change
#GroupCommitFunction.apply,即调用算子GroupCommitFunction(binded_tensor),
# This is GroupCommitFunction.apply, which calls the operator GroupCommitFunction(binded_tensor).

handler.post_forward(model_self)
return out

#普通方法绑定为module 的方法,因此完成了原始module.forward的包裹
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese. To maintain a consistent language in the codebase, please translate it to English or remove it if it's not necessary.

Suggested change
#普通方法绑定为module 的方法,因此完成了原始module.forward的包裹
# Bind the plain method as a method of the module, thus wrapping the original module.forward.

Comment on lines +578 to +581
# get_layers(model)
# if len(layers) < 3:
# logger.warning(f"Find only {len(layers)} fsdp layers, not neccessary to enable async activation offloading")
# return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines appear to be commented-out debug code. They should be removed to improve code clarity and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants