feat: Fast model loading for inference#125
Open
bayo-ibm wants to merge 2 commits intofoundation-model-stack:mainfrom
Open
feat: Fast model loading for inference#125bayo-ibm wants to merge 2 commits intofoundation-model-stack:mainfrom
bayo-ibm wants to merge 2 commits intofoundation-model-stack:mainfrom
Conversation
Contributor
Author
|
I added the new feature that allows fast model loading for inference. |
fms_mo/prep.py
Outdated
| """Check if model is already quantized - do not want to quantize twice if so""" | ||
| return any(isinstance(m, quantized_modules) for m in model.modules()) | ||
|
|
||
| def swap_qbmm(model, qcfg): |
Collaborator
There was a problem hiding this comment.
Need to add doc string and add datatypes to function args
fms_mo/utils/qconfig_utils.py
Outdated
| """Read config in json format, work together with qconfig_save""" | ||
| config = get_recipe(fname) | ||
|
|
||
Collaborator
There was a problem hiding this comment.
Dead spacing here. Delete it.
BrandonGroth
requested changes
May 23, 2025
Collaborator
There was a problem hiding this comment.
A few more nitpicks.
Also, please run the following and fix anything that lint or spellcheck does. "tox -e fix" will automatically change files, you just have to add + commit them. If multiple changes are needed, package them up in 1 commit if possible.
tox -e fix
tox -e lint
tox -e spellcheck
Signed-off-by: omobayode.fagbohungbe <omobayode.fagbohungbe@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the change
This PR enables faster loading of a quantized model by calling only the functions/sub-functions needed to load a model while ignoring the functions needed for quantizing the model. An inference argument was added to the fms_mo argument to activate the function.
we need 2 new functions, which look like:
fp8_model_load( <a fp8 checkpoint by llm-compressor> ), it will load an existing fp8 ckpt intofms-moand proper quantizers should be configured when possible, i.e. may need to parse quantization block inconfig.jsonor equivalent file.fp8_model_save( <qmodel from fms-mo> ), it will save a compatible fp8 checkpoint that can be consumed by vllm or aiu-compiler.Related issue number
None
How to verify the PR
The PR is validated by performing Direct Quantization with Smooth Quant and parsing the inference argument with the rest of the argument. The validation was done with/without Qbmm.
Was the PR tested