🚀 The feature, motivation and pitch
Problem
ETDump builds its output flatbuffer inline, while inference is running. Every event that gets traced (op profiling, intermediate-output logging, allocations) drives the flatbuffer builder synchronously, which includes building tables, pushing size/stride vectors, interning strings, etc. That serialization work is paid per event, on the critical path, so enabling ETDump adds significant latency to the run. When doing per-op profiling, this can be subtracted out in the inspector post-processing, but it can give a misleading indicator of "framework tax." Ie, how much latency does the framework itself add to a run? Some users wind up profiling models once with ETDump enabled and once without, just to be able to get both E2E & per-op profiling.
Proposed idea
Decouple data collection from serialization. During inference, record events into in-memory objects (cheap appends, no flatbuffer work). Then, after inference completes (or rather, when user requests the data), walk those collected objects once and serialize the flatbuffer in a single pass. This keeps the flatbuffer format and downstream tooling unchanged. The downside to this approach is that it will increase the memory needs, but that may be worth the trade off for some. I'd recommend that this be an alternative implementation of ETDumpGen rather than a replacement.
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
🚀 The feature, motivation and pitch
Problem
ETDump builds its output flatbuffer inline, while inference is running. Every event that gets traced (op profiling, intermediate-output logging, allocations) drives the flatbuffer builder synchronously, which includes building tables, pushing size/stride vectors, interning strings, etc. That serialization work is paid per event, on the critical path, so enabling ETDump adds significant latency to the run. When doing per-op profiling, this can be subtracted out in the inspector post-processing, but it can give a misleading indicator of "framework tax." Ie, how much latency does the framework itself add to a run? Some users wind up profiling models once with ETDump enabled and once without, just to be able to get both E2E & per-op profiling.
Proposed idea
Decouple data collection from serialization. During inference, record events into in-memory objects (cheap appends, no flatbuffer work). Then, after inference completes (or rather, when user requests the data), walk those collected objects once and serialize the flatbuffer in a single pass. This keeps the flatbuffer format and downstream tooling unchanged. The downside to this approach is that it will increase the memory needs, but that may be worth the trade off for some. I'd recommend that this be an alternative implementation of ETDumpGen rather than a replacement.
Alternatives
No response
Additional context
No response
RFC (Optional)
No response