diff --git a/docs/src/content/docs/configuration/fp8-storage.mdx b/docs/src/content/docs/configuration/fp8-storage.mdx index e28b68e2400..65c76d3b380 100644 --- a/docs/src/content/docs/configuration/fp8-storage.mdx +++ b/docs/src/content/docs/configuration/fp8-storage.mdx @@ -23,7 +23,7 @@ There is no hardware requirement for FP8 *compute* — InvokeAI casts back to FP ## Hardware support tiers -Because InvokeAI's FP8 path uses `enable_layerwise_casting` — storage in FP8, compute in BF16/FP16 — the practical benefit of toggling FP8 Storage depends on what your GPU can do natively. There are three tiers: +InvokeAI's FP8 path stores weights in FP8 and casts them back to BF16/FP16 on each forward pass via its own `register_forward_pre_hook` / `register_forward_hook` wrappers (the same skip list as diffusers' `apply_layerwise_casting`, but applied to every `nn.Module` — including diffusers `ModelMixin` subclasses — so it composes correctly with InvokeAI's `CustomLinear` and partial loading). The practical benefit of toggling FP8 Storage depends on what your GPU can do natively. There are three tiers: ### RTX 30-series and older Ampere workstation cards — VRAM win only @@ -31,11 +31,15 @@ The toggle works as advertised: the UNet / transformer drops by roughly 50% on t ### RTX 40-series, RTX 50-series, and Hopper — VRAM win today, compute win possible later -These GPUs have native FP8 tensor cores. The toggle still buys you the same ~50% VRAM reduction today, because the forward pass still runs in BF16 under `enable_layerwise_casting`. If InvokeAI later wires up a true FP8 matmul path (e.g. via `torchao`), the same toggle will *also* unlock compute speedups on this hardware. Until then, treat the benefit as "VRAM only, same as Ampere". +These GPUs have native FP8 tensor cores. The toggle still buys you the same ~50% VRAM reduction today, because the forward pass still runs in BF16 — the hook casts weights back up to compute precision before each layer. If InvokeAI later wires up a true FP8 matmul path (e.g. via `torchao`), the same toggle will *also* unlock compute speedups on this hardware. Until then, treat the benefit as "VRAM only, same as Ampere". -### Pre-Ampere Nvidia, MPS, and CPU — no-op +### Older CUDA cards — still a VRAM win -FP8 Storage is silently disabled on anything that is not CUDA, and it is not meaningful on pre-Ampere CUDA cards either. On CPU PyTorch *technically* supports FP8 dtypes, but the cast operations are software-emulated and end up costing more than the memory savings buy back, so InvokeAI does not apply FP8 Storage on CPU. If you toggle it on unsupported hardware, the loader logs nothing and returns the model unchanged — the UI may also grey the toggle out or show a "not supported on this hardware" note. +`float8_e4m3fn` is a pure storage dtype in PyTorch and works on any CUDA device, so pre-Ampere cards (GTX 16-series, RTX 20-series, etc.) get the same ~50% VRAM reduction as Ampere. There are no native FP8 tensor cores on these GPUs, so the throughput trade-off is the same as on the 30-series: cast in, compute in BF16/FP16, cast back out. + +### MPS and CPU — no-op + +FP8 Storage is silently disabled on anything that is not CUDA. On CPU PyTorch *technically* supports FP8 dtypes, but the cast operations are software-emulated and end up costing more than the memory savings buy back, so InvokeAI gates the entire path on `device.type == "cuda"`. If you toggle it on CPU or MPS, the loader skips the cast and returns the model unchanged with no log line. ## Enabling FP8 Storage @@ -108,3 +112,17 @@ Disable FP8 Storage for that model in Model Manager and reload. If quality is re ### "RuntimeError: ... float8_e4m3fn ..." You're on a PyTorch version that predates FP8 support. Reinstall InvokeAI using the official launcher — the bundled torch version supports FP8. + +### Reporting an FP8 issue + +If FP8 Storage misbehaves — crash, quality regression, OOM that shouldn't happen — please [open a GitHub issue](https://github.com/invoke-ai/InvokeAI/issues/new/choose) and include: + +- **What you did**: the workflow / generation step that triggered the problem, and whether it reproduces every time. +- **Model**: exact name and variant (e.g. "FLUX.2 Klein 9B Diffusers", "SDXL Base 1.0 single-file"), and whether the file is a full-precision checkpoint or already quantized (GGUF / NF4 / int8). +- **LoRAs**: whether any LoRAs (or ControlLoRAs) are stacked on the model, and how many. +- **Other toggles**: Low-VRAM mode on/off, any `cpu_only` text encoder setting, configured VRAM limit. +- **GPU**: model and VRAM size (e.g. "RTX 3090 24 GB", "RTX 4070 Ti 12 GB"). +- **OS**: Windows or Linux, plus driver / CUDA version if you have it. +- **Logs**: the InvokeAI log around the failure — in particular the `FP8 layerwise casting enabled for ` line (or its absence) and any traceback. + +A side-by-side image comparison (FP8 on vs. FP8 off, same seed) is extremely useful for quality regressions.