-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support features of cut cross entropy, TiledMLP and activation_offload #7129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @vx120, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the training framework by integrating advanced memory optimization and acceleration features. It introduces 'cut cross entropy' and 'Tiled MLP' to handle large models and long sequences more efficiently, and 'activation CPU offloading' to further reduce GPU memory consumption. Additionally, it adds robust support for elastic training through DLRover, allowing for more flexible and resilient training workflows in dynamic environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several significant features, including cut-cross-entropy (CCE), TiledMLP, activation offloading, and elastic training support. The changes are extensive, covering documentation, examples, and core library code. Overall, the implementation of these new features is robust and well-structured. The new documentation is comprehensive, although there are some minor formatting issues and errors in the provided examples and command descriptions that need attention. The core code additions, particularly for activation offloading and TiledMLP, are complex but appear to be correctly implemented, with thoughtful handling of different distributed training strategies like FSDP2 and DeepSpeed. My review focuses on enhancing documentation clarity, correcting errors in example scripts, and addressing minor code cleanup opportunities such as removing leftover comments. A critical issue involving a merge conflict marker in a documentation file has also been identified and must be resolved.
| - 🔥use_liger_kernel: Whether to enable the [Liger](https://github.com/linkedin/Liger-Kernel) kernel to accelerate training and reduce GPU memory consumption. Defaults to False. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/liger). | ||
| - Note: Liger kernel does not support `device_map`. Use DDP or DeepSpeed for multi-GPU training. Currently, liger_kernel only supports `task_type='causal_lm'`. | ||
| - use_cce: Whether to enable the [cut-cross-entropy](https://github.com/apple/ml-cross-entropy) fused operator to reduce GPU memory usage and accelerate training. Defaults to `False`. Example shell script can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/cce). | ||
| ======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --torch_dtype bfloat16 \ | ||
| --per_device_train_batch_size 10 \ | ||
| --gradient_accumulation_steps 2 \ | ||
| --gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment // no need to checkpoint activations when offloading to CPU uses a C-style comment //, which is not valid in shell scripts and will cause a syntax error. It is also placed after a line continuation character \, which is invalid. The comment should be moved to its own line before this one, using #.
| --gradient_checkpointing false \ // no need to checkpoint activations when offloading to CPU | |
| --gradient_checkpointing false \ |
| deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json | ||
|
|
||
| dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \ | ||
| /opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to sft.py is hardcoded to a specific conda environment. This approach is brittle and not easily portable. It would be more robust to use the -m flag of dlrover-run to execute the module directly.
| /opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \ | |
| -m swift.cli.sft --model $model \ |
| - eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`。 | ||
| - use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化,目前暂不支持safetensors格式;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。 | ||
| - use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。 | ||
| - elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting for the pip command is incorrect, which may cause rendering issues. Additionally, there are some punctuation inconsistencies, such as using English commas instead of Chinese ones.
| - elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md) | |
| - elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes`,具体使用参考[示例](../BestPractices/Elastic.md) |
|
|
||
| ## Installing Dependencies | ||
|
|
||
| Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pip install command is not correctly formatted within backticks, and it is missing a closing backtick. This should be corrected for proper rendering.
| Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift` | |
| Deploy a K8S cluster and deploy [DLRover](https://github.com/intelligent-machine-learning/dlrover) in the cluster, and install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`. |
| restartPolicy: Never | ||
| containers: | ||
| - name: main | ||
| image: #【Training image, needs to have deepspeed, dlrover, and swift installed】 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment uses Chinese-style brackets 【】. For consistency in English documentation, it's better to use standard comment conventions, such as parentheses or plain text.
| image: #【Training image, needs to have deepspeed, dlrover, and swift installed】 | |
| image: # Training image, needs to have deepspeed, dlrover, and swift installed |
| - eval_limit: Number of samples from the evaluation dataset | ||
| - eval_generation_config: Model inference configuration during evaluation, in JSON format, default is `{'max_tokens': 512}` | ||
| - use_flash_ckpt: Whether to use [DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover). Default is `false`. If enabled, checkpoints are saved to memory synchronously, then persisted to storage asynchronously, the safetensors format is not supported currently. It's recommended to use this with the environment variable `PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` to avoid CUDA OOM. | ||
| elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting for the pip command is incorrect, which may cause rendering issues. It's missing a closing backtick, and there should be a space after the comma for better readability.
| elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`, For specific usage, refer to the [example](../BestPractices/Elastic.md) | |
| elastic: Whether to enable elasticity, which depends on [DLRover](https://github.com/intelligent-machine-learning/dlrover), Install the required packages using `pip install dlrover && pip install tornado && pip install kubernetes`. For specific usage, refer to the [example](../BestPractices/Elastic.md) |
| binded_tensor = ret | ||
| if isinstance(ret, tuple): | ||
| binded_tensor = ret[0] | ||
| #即 GroupCommitFunction.apply,即调用算子GroupCommitFunction(binded_tensor), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is in Chinese. For consistency with the rest of the codebase, it should be translated to English or removed if it is not essential for understanding the code.
| #即 GroupCommitFunction.apply,即调用算子GroupCommitFunction(binded_tensor), | |
| # This is GroupCommitFunction.apply, which calls the operator GroupCommitFunction(binded_tensor). |
| handler.post_forward(model_self) | ||
| return out | ||
|
|
||
| #普通方法绑定为module 的方法,因此完成了原始module.forward的包裹 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is in Chinese. To maintain a consistent language in the codebase, please translate it to English or remove it if it's not necessary.
| #普通方法绑定为module 的方法,因此完成了原始module.forward的包裹 | |
| # Bind the plain method as a method of the module, thus wrapping the original module.forward. |
| # get_layers(model) | ||
| # if len(layers) < 3: | ||
| # logger.warning(f"Find only {len(layers)} fsdp layers, not neccessary to enable async activation offloading") | ||
| # return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR type
PR information
It includes multiple features.
cut cross entropy:
commit 2dee5e5
TiledMLP:
commit e64c462
activation_offload:
commit 8b8deb6