-
Notifications
You must be signed in to change notification settings - Fork 610
[pyTorch] CPU performance optimizations #2439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/te-ci pytorch |
| def fast_set_attr(self, name: str, value: Any) -> None: | ||
| self.__dict__[name] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we are separating out this function so we can manually avoid overheads from __setattr__ and dict? Doing some benchmarking:
dictread: 9 nsdictwrite: 13 nsdictin: 9 nsdict.get: 14 ns- Function call: 9 ns
- Class attr read: 3 ns
- Class attr write: 5 ns
- Class custom getattr: 101 ns
- Class custom setattr: 134 ns
Benchmarking script
I ran the following on a GB200 node. For the dict times, I subtracted out the overhead from list reads. For the class getattr/setattr times, I subtracted out the overhead from range.
import contextlib
import time
class Timer:
"""Measure time interval."""
def __init__(self) -> None:
self._start = None
self._end = None
def time(self) -> float:
"""CPU time interval in seconds."""
return self._end - self._start
@contextlib.contextmanager
def context(self):
"""Context manager to capture time interval."""
self._start = time.perf_counter()
yield
self._end = time.perf_counter()
def main() -> None:
# Options
iters = 1024 * 1024
# Timer
timer = Timer()
# Dummy data
str_list = ["lorem", "ipsum", "dolor", "sit", "amet", "consectetur", "adipiscing", "elit"]
str_list = [str_list[i % len(str_list)] for i in range(iters)]
str_dict = {s: len(s) for s in str_list}
class PlainClass:
def __init__(self) -> None:
self.attr = 1
class CustomGetattrSetattrClass:
def __init__(self) -> None:
self.attr = 1
def __getattribute__(self, name):
return super().__getattribute__(name)
def __setattr__(self, name, val):
super().__setattr__(name, val)
# Timer overhead
with timer.context():
pass
print(f"Timer overhead: {timer.time() * 1e9 / iters} ns/iter")
# Range loop
with timer.context():
for _ in range(iters):
pass
print(f"Range loop: {timer.time() * 1e9 / iters} ns/iter")
# List loop
with timer.context():
for _ in str_list:
pass
print(f"List loop: {timer.time() * 1e9 / iters} ns/iter")
# Empty range+enumerate loop
with timer.context():
for i, j in enumerate(range(iters)):
pass
print(f"Range+enumerate loop: {timer.time() * 1e9 / iters} ns/iter")
# Empty range+enumerate loop
with timer.context():
for i, s in enumerate(str_list):
pass
print(f"List+enumerate loop: {timer.time() * 1e9 / iters} ns/iter")
# List reads
with timer.context():
for i in range(iters):
str_list[i]
print(f"List reads: {timer.time() * 1e9 / iters} ns/iter")
# Dict reads
with timer.context():
for i in range(iters):
str_dict[str_list[i]]
print(f"Dict reads: {timer.time() * 1e9 / iters} ns/iter")
# Dict get
with timer.context():
for i in range(iters):
str_dict.get(str_list[i], None)
print(f"Dict gets: {timer.time() * 1e9 / iters} ns/iter")
# Dict writes
with timer.context():
for i in range(iters):
str_dict[str_list[i]] = i
print(f"Dict writes: {timer.time() * 1e9 / iters} ns/iter")
# Dict membership
with timer.context():
for i in range(iters):
str_list[i] in str_dict
print(f"Dict membership: {timer.time() * 1e9 / iters} ns/iter")
# Function call
def func() -> None:
pass
with timer.context():
for _ in range(iters):
func()
print(f"Function call: {timer.time() * 1e9 / iters} ns/iter")
# Function call
func = lambda: None
with timer.context():
for _ in range(iters):
func()
print(f"Lambda call: {timer.time() * 1e9 / iters} ns/iter")
# Class attr read
myobj = PlainClass()
with timer.context():
for _ in range(iters):
_ = myobj.attr
print(f"Class attr read: {timer.time() * 1e9 / iters} ns/iter")
# Class attr write
myobj = PlainClass()
with timer.context():
for i in range(iters):
myobj.attr = i
print(f"Class attr write: {timer.time() * 1e9 / iters} ns/iter")
# getattr
myobj = PlainClass()
with timer.context():
for _ in range(iters):
getattr(myobj, "attr", None)
print(f"getattr: {timer.time() * 1e9 / iters} ns/iter")
# getattr
myobj = PlainClass()
with timer.context():
for i in range(iters):
setattr(myobj, "attr", i)
print(f"setattr: {timer.time() * 1e9 / iters} ns/iter")
# Class custom getattr
myobj = CustomGetattrSetattrClass()
with timer.context():
for _ in range(iters):
_ = myobj.attr
print(f"Class custom getattr: {timer.time() * 1e9 / iters} ns/iter")
# Class custom setattr
myobj = CustomGetattrSetattrClass()
with timer.context():
for i in range(iters):
myobj.attr = i
print(f"Class custom setattr: {timer.time() * 1e9 / iters} ns/iter")
if __name__ == "__main__":
main()How much perf difference do you observe from fast_set_attr? I could see how it could save us ~1 us of overhead, but it would be good to make sure before making the code messier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to comment too much on the perf results yet since up till now they all come from my machine and not a real cluster, but that anecdotal evidence shows that the time of the small test of just running BF16 Linear layer forward for many iterations after the proposed code changes go from 9.2 to 7.7 s. The fast_set_attr alone brought it to ~8.4s.
I will test it properly and report the timings in the description of the PR.
Now, about introducing the separate function - since ultimately this is the optimization that you came up with at some point, there already was the machinery to not do the expensive Module.set_attr for some parameters. The problem that I see is discoverability - if people do not study that code very cautiously they will not realize that they should not just do self.something = something. Therefore I think we should actually go a more explicit way and in the set_attr of TE module just error out with a message to either use fast_set_attr for the things we are sure are just small values (since the usage of dict directly has some problems BTW since it e.g. bypasses properties and stuff) and use a new function, let's call it just set_attr for anything where we need the full machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to ban self.something = something. I think readability and safety are more important for non-performance-critical things like initialization and checkpointing. It would be better to make this function an advanced internal implementation with a name like _fast_setattr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we then make sure that this does not resurface in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with the explicit setattr calls and having a warning issued when the regular setattr function is used. That way the users can still use the regular setattr call if they want, but for the internal development we make sure during testing that the warning does not trigger. To make the code less ugly we only turn on the warning after the constructor is finished - that way we can still use the nice syntax during construction (where there are the most occurences) since we do not care about the speed there.
5eefe3e to
1c7d896
Compare
948747b to
c4e380f
Compare
|
/te-ci pytorch |
Greptile SummaryThis PR introduces several CPU performance optimizations targeting overhead reduction in the PyTorch modules. Key Changes:
Architecture Notes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as User Code
participant Linear as Linear.forward()
participant Base as TransformerEngineBaseModule
participant NVTX as NVTX Profiler
User->>Linear: forward(inp)
Linear->>Base: prepare_forward(inp)
Base->>Base: init_fp8_metadata()
Base->>NVTX: nvtx_range_push("Linear forward")
Base-->>Linear: return inp
Note over Linear: try block
Linear->>Linear: _get_weight_and_bias_tensors()
Linear->>Linear: _get_quantizers()
Linear->>Linear: linear_fn(*args)
Note over Linear: finally block
Linear->>Base: end_forward()
Base->>Base: restore_fp8_meta_tensors (if needed)
Base->>NVTX: nvtx_range_pop()
Linear-->>User: return out
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 5 comments
Greptile OverviewGreptile SummaryThis PR attempts to optimize CPU overhead by:
Critical Issues Found1. NVTX Range Imbalance on Exceptions (HIGH SEVERITY)The refactoring from context managers to manual 2. setattr Doesn't Actually Optimize (HIGH SEVERITY)The new 3. Multiple RuntimeWarning Violations (CRITICAL SEVERITY)Six locations in base.py use direct attribute assignment after initialization (lines 965, 966, 1558, 1559, 1565, 1581, 1608). Since pytest.ini now treats RuntimeWarnings as errors, all tests will fail. Positive Aspects
RecommendationThis PR cannot be merged in its current state due to the RuntimeWarning violations that will cause all tests to fail. The NVTX exception safety issue is also critical for production use. Confidence Score: 0/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant Module as Linear/LayerNorm Module
participant Base as TransformerEngineBaseModule
participant NVTX as NVTX Range Stack
participant FP8 as FP8GlobalStateManager
User->>Module: forward(inp)
Module->>Base: prepare_forward(inp)
Base->>Base: init_fp8_metadata()
Base->>Base: get_amp_dtype()
Base->>FP8: Copy FP8 metadata (if recompute)
Base->>NVTX: nvtx_range_push("Module forward")
Note over NVTX: Range pushed - needs cleanup!
Base-->>Module: returns processed inp
alt Exception occurs during forward
Module->>Module: Compute quantizers/weights
Module--xModule: Exception raised!
Note over NVTX: ⚠️ BUG: nvtx_range_pop() never called!
Note over NVTX: Stack becomes imbalanced
else Normal execution
Module->>Module: Compute quantizers/weights
Module->>Module: Call linear_fn()
Module->>Base: end_forward()
Base->>FP8: Restore FP8 metadata (if recompute)
Base->>NVTX: nvtx_range_pop()
Note over NVTX: Range properly cleaned up
Base-->>Module: return
Module-->>User: return output
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 7 comments
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) | ||
| super().__setattr__(name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __setattr__ override doesn't actually optimize anything - it still calls super().__setattr__(name, value) after emitting the warning. This means every attribute assignment still goes through the slow PyTorch Module path, defeating the purpose of the optimization.
The warning is also emitted on EVERY attribute assignment after initialization, which would be extremely noisy and unhelpful for debugging.
Additionally, there are multiple places in the codebase that still use direct attribute assignment after initialization:
- Line 965:
self.tp_group = tp_groupinset_tensor_parallel_group() - Line 966:
self.tp_group_initialized = True - Lines 1558, 1559, 1565, 1581, 1608: Various debug-related attribute assignments
Since pytest.ini now treats RuntimeWarnings as errors, all of these will cause test failures.
Recommendation: Either:
- Remove the
__setattr__override entirely and rely on developers remembering to usefast_setattr, or - Update all the remaining direct assignments to use
fast_setattrormodule_setattr
Additional Comments (5)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 8 comments
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) | ||
| super().__setattr__(name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __setattr__ override emits a RuntimeWarning every time it's called after initialization, but it still calls super().__setattr__() to perform the assignment. This means:
- The warning is emitted but the "slow" operation still happens - there's no actual optimization here
- With pytest.ini treating RuntimeWarning as error, any accidental direct attribute assignment will cause test failures
- The warning message suggests using
fast_setattrbut doesn't prevent the slow path
If the goal is CPU optimization, this should either:
- Raise an exception instead of warning (fail-fast), OR
- Actually use the fast path by default after initialization
The current implementation warns about performance but doesn't fix it.
Additional Comments (5)
This method is called after module initialization in user code, so it will fail. The assignments should use
These should use
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 6 comments
| def fast_setattr(self, name: str, value: Any) -> None: | ||
| """ | ||
| Fast version of the Module's set attribute function. | ||
| Should be used for regular attributes, but not properties nor parameters/buffers. | ||
| """ | ||
| self.__dict__[name] = value | ||
|
|
||
| def module_setattr(self, name: str, value: Any) -> None: | ||
| """ | ||
| Regular version of the Module's set attribute function. | ||
| Should be used only when the fast version cannot be used - for the properties, | ||
| parameters and buffers. | ||
| """ | ||
| super().__setattr__(name, value) | ||
|
|
||
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design Issue: The setattr override doesn't actually optimize anything
The new __setattr__ implementation still calls super().__setattr__(), which means it still incurs the full PyTorch Module overhead. The warning is issued but the slow path is always taken.
To actually optimize, the code should:
- Skip the warning during
__init__(checkif "_initialized" not in self.__dict__) - After init, require explicit use of
fast_setattr()ormodule_setattr() - Consider making
__setattr__raise an error instead of a warning to enforce the pattern
Currently, this just adds overhead (checking _initialized + issuing warning) on top of the existing slow path.
| def train(self, mode: bool = True) -> Self: | ||
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings("ignore", ".*CPU overhead.*") | ||
| return super().train(mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround Needed: train() method silences the RuntimeWarning
This method filters out the "CPU overhead" warning because PyTorch's Module.train() internally uses __setattr__ to set the training flag on all child modules. This is a workaround for the fact that the base implementation can't be avoided in some cases.
This reveals a potential issue: other PyTorch internal methods that use __setattr__ will also trigger warnings. Consider:
to()device transfershalf(),float(),bfloat16()dtype conversionsregister_module(),register_buffer(),register_parameter()
These may all need similar workarounds.
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) | ||
| super().__setattr__(name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] setattr warning defeats the optimization: The __setattr__ override emits a RuntimeWarning when _initialized=True, which pytest.ini treats as an error. However, __setattr__ still calls super().__setattr__(), meaning it doesn't actually prevent or optimize anything - it just warns.
If the goal is CPU optimization, the warning should either:
- Be accompanied by raising an exception to force using
fast_setattr(), or - Implement the fast path directly in
__setattr__when appropriate
Currently, any test that triggers the warning will fail, but production code will just emit warnings with no actual optimization benefit from this implementation.
Additional Comments (4)
These direct assignments using This method is called after module initialization (when
These direct assignments will trigger RuntimeWarning since
This assignment will also trigger the RuntimeWarning.
|
Additional Comments (6)
This affects all forward methods in Linear, LayerNormLinear, LayerNormMLP, and GroupedLinear modules. Any exception (shape mismatch, CUDA OOM, assertion failure, etc.) will leave an unclosed NVTX range. The original context manager pattern guaranteed cleanup via
If the goal is CPU optimization, the warning should either:
Currently, any test that triggers the warning will fail, but production code will just emit warnings with no actual optimization benefit from this implementation.
And at line 1571:
If the warnings are genuinely problematic here, the underlying issue should be fixed rather than suppressed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 4 comments
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) | ||
| super().__setattr__(name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARNING: setattr override defeats the optimization it's meant to enforce
The __setattr__ override emits a RuntimeWarning when _initialized=True, but then still calls super().__setattr__(name, value) which goes through PyTorch's slow path anyway. This means:
- You pay the cost of the warning check
- You still pay the full cost of PyTorch's
__setattr__ - The warning is treated as an error by pytest, so any code path using it will fail tests
This doesn't actually optimize anything - it just breaks code that uses standard attribute assignment. The optimization only works when code explicitly calls fast_setattr().
Implications:
- Any code path (including external libraries or user code) that does
module.attribute = valuewill fail in tests - The
train()method workaround (lines 1111-1114) shows this is already a problem - Properties, descriptors, and any dynamic attribute access will trigger this warning
Recommended approach:
- Remove the warning from
__setattr__- it creates more problems than it solves - Keep
fast_setattr()as an opt-in optimization - Add comments/documentation encouraging use of
fast_setattr()where performance matters
Additional Comments (2)
These direct assignments using
These direct assignments to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
Additional Comments (3)
The
The
This direct assignment to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, 9 comments
| def train(self, mode: bool = True) -> Self: | ||
| with warnings.catch_warnings(): | ||
| warnings.filterwarnings("ignore", ".*CPU overhead.*") | ||
| return super().train(mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] The train() method suppresses the CPU overhead warning by filtering it out. However, this means PyTorch's __setattr__ is still being called for all the child module state changes during train()/eval() mode switching. This defeats the optimization goal since the expensive __setattr__ path is still taken - it just doesn't warn about it.
The proper fix would be to override train() to use fast_setattr for the training attribute on all child modules, rather than suppressing the warning.
| def __setattr__(self, name: str, value: Any) -> None: | ||
| if name in TransformerEngineBaseModule._fast_setattr_names: | ||
| # torch.nn.Module has a custom __setattr__ that handles | ||
| # modules, parameters, and buffers. This is unnecessary | ||
| # overhead when setting plain attrs. | ||
| self.__dict__[name] = value | ||
| else: | ||
| # Default case | ||
| super().__setattr__(name, value) | ||
| if "_initialized" in self.__dict__ and self._initialized: | ||
| warnings.warn( | ||
| """The default implementation of torch.nn.Module introduces significant CPU overhead | ||
| when setting attributes and is therefore not recommended. Please use the explicit | ||
| calls (fast_setattr for setting regular values and module_setattr for setting | ||
| parameters, children modules and buffers).""", | ||
| RuntimeWarning, | ||
| ) | ||
| super().__setattr__(name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] The __setattr__ override still calls super().__setattr__(name, value) after emitting the warning, meaning the expensive PyTorch __setattr__ is always executed. This doesn't optimize anything - it only adds overhead by checking _initialized and potentially emitting a warning before doing the expensive operation anyway.
If the goal is CPU optimization, the __setattr__ should use fast_setattr for appropriate attributes instead of just warning. The current implementation only serves as a detection mechanism, not an optimization.
Additional Comments (3)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 7 comments
Additional Comments (3)
|
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
f96abbd to
2fb6ee3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
transformer_engine/pytorch/module/base.py, line 1570 (link)syntax: CRITICAL: RuntimeWarning will cause test failures
This direct attribute assignment will also trigger the RuntimeWarning.
17 files reviewed, 1 comment
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Greptile's behavior is changing!From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
transformer_engine/pytorch/module/base.py, line 967-968 (link)logic:
set_tensor_parallel_groupuses direct attribute assignment instead offast_setattr. This public method is documented to be called after module initialization, which will trigger RuntimeWarning (now treated as error by pytest.ini).
19 files reviewed, 1 comment
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
way Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
/te-ci L1 pytorch |
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12 files reviewed, 1 comment
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
/te-ci L1 pytorch |
2ee3442 to
cd11e67
Compare
Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
cd11e67 to
028d03f
Compare
|
/te-ci pytorch L1 |
Description
This PR includes a few performance optimizations targeting the CPU overhead. The code, perf numbers etc. are WIP. The code gets kind of ugly though :-(.
For the prepare_forward changes I did not touch attention (@cyanguwa FYI) since it has multiple exit points from the forward and was worried that I would miss something there - it would be great if we could refactor that part first to have a single return statement instead.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: