ARM Ethos-U: Operators not properly quantized when `cat` is present

### 🐛 Describe the bug

When lowering this MWE for ARM Ethos-U55:
```python
from pathlib import Path

import torch.nn as nn


class NoState1DequantPerChannelMWE(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv0 = nn.Conv2d(2, 2, kernel_size=(1, 1), bias=False)

    def forward(self, x):
        y = torch.ops.aten.hardtanh.default(self.conv0(x), 0.0, 6.0)
        y = torch.cat([x, y], dim=3)
        return y


x_cf = torch.zeros(1, 2, 1, 1)

no_state1_mwe = torch.export.export(
    NoState1DequantPerChannelMWE().eval(),
    (x_cf,),
    strict=True,
    )
no_state1_mwe_path = Path("mwe.pt2")
torch.export.save(no_state1_mwe, str(no_state1_mwe_path))
```

with this lowering code:
```python
from pathlib import Path

import torch

from executorch.backends.arm.ethosu import EthosUCompileSpec, EthosUPartitioner
from executorch.backends.arm.quantizer import (
    EthosUQuantizer,
    get_symmetric_quantization_config,
)
from executorch.backends.cortex_m.passes.quantized_op_fusion_pass import (
    QuantizedOpFusionPass,
)
from executorch.backends.cortex_m.passes.replace_quant_nodes_pass import (
    ReplaceQuantNodesPass,
)
from executorch.exir import (
    EdgeCompileConfig,
    ExecutorchBackendConfig,
    to_edge_transform_and_lower,
)
from executorch.exir.passes.memory_planning_pass import MemoryPlanningPass
from executorch.extension.export_util.utils import save_pte_program
from torch.export import export
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

COMPILE_SPEC = EthosUCompileSpec(
    "ethos-u55-128",
    config_ini=str(Path.cwd() / "my_vela.ini"),
)

MWE_INPUT = torch.randn(1, 2, 1, 1)


def export_pt2_to_pte(
    pt2_path: str | Path,
) -> dict[str, object]:
    pt2_path = Path(pt2_path)
    output_path = pt2_path.with_suffix(".helper.pte")

    exported_program = torch.export.load(str(pt2_path))
    module = exported_program.module(check_guards=False)

    quantizer = EthosUQuantizer(COMPILE_SPEC)
    quantizer.set_global(get_symmetric_quantization_config())
    prepared_model = prepare_pt2e(module, quantizer)
    prepared_model(MWE_INPUT)
    quantized_model = convert_pt2e(prepared_model)

    quantized_exported_program = export(
        quantized_model,
        (MWE_INPUT,),
        strict=True,
    )
    edge_program_manager = to_edge_transform_and_lower(
        quantized_exported_program,
        partitioner=[EthosUPartitioner(COMPILE_SPEC)],
        compile_config=EdgeCompileConfig(_check_ir_validity=False),
    )
    edge_program_manager = edge_program_manager.transform(
        [ReplaceQuantNodesPass(), QuantizedOpFusionPass()]
    )

    executorch_program = edge_program_manager.to_executorch(
        config=ExecutorchBackendConfig(
            memory_planning_pass=MemoryPlanningPass(alloc_graph_input=False),
            extract_delegate_segments=False,
        )
    )

    save_pte_program(executorch_program, str(output_path))
```
the output contains a convolution operator and activation that is not quantized and lowered to U55. The lowering yields a warning saying that these nodes could not be lowered to U55, because "One or more inputs were not quantized".

~~Note: For this to reproduce, one has to set `Sram_write_latency=16` in `vela.ini` (default value is 32, which doesn't seem to reproduce this issue. Haven't tested other values.)~~
For some reason, this issue seems to disappear when using the default vela.ini and reappears when passing `my_vela.ini`, which is simply a copy of `vela.ini` in the local directory. I don't fully understand the mechanism that would cause this. If it doesn't reproduce, I would appreciate any help in narrowing down what causes this behavior.

## Expected Outcome
Fully quantized graph

## Actual Outcome
Conv + ReLU6 (hardtanh) are FP32 ops

<img width="302" height="977" alt="Image" src="https://github.com/user-attachments/assets/7e7a0dc6-4e5c-4075-bc28-7ac7b84a818e" />



### Versions

Collecting environment information...
PyTorch version: 2.12.0+cpu
ExecuTorch version: 1.3.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.8 (++20240731025043+3b5b5c1ec4a3-1~exp1~20240731145144.92)
CMake version: version 4.1.2
Libc version: glibc-2.39

Python version: 3.10.15 (main, Oct 16 2024, 04:37:23) [Clang 18.1.8 ] (64-bit runtime)
Python platform: Linux-6.18.33.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARM Ethos-U: Operators not properly quantized when `cat` is present #20486

🐛 Describe the bug

Expected Outcome

Actual Outcome

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ARM Ethos-U: Operators not properly quantized when cat is present #20486

Description

🐛 Describe the bug

Expected Outcome

Actual Outcome

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

ARM Ethos-U: Operators not properly quantized when `cat` is present #20486