feat: Support configurable mesh locality and device count in TPU benchmarks by simrankaurb · Pull Request #125 · AI-Hypercomputer/accelerator-microbenchmarks

simrankaurb · 2026-06-04T18:35:29Z

This PR introduces dynamic parameterization for mesh locality and device counts. It allows benchmarks to configure whether they run on the local host or the full slice, while preserving the original full-slice behavior as the default.

Changes

1. Utility Changes (`Ironwood/src/benchmark_utils.py`)

Updated create_mesh to accept a local_mesh boolean parameter (defaults to False). If True, it restricts mesh creation to jax.local_devices().
Updated sharding helpers (handle_per_device_based_on_sharding and handle_all_devices_based_on_sharding) to accept an explicit device_count parameter.
Updated handle_based_on_sharding to accept an optional device_count parameter (defaults to jax.device_count()), passing it down to the helpers.

2. Benchmark API Changes (`Ironwood/src/benchmark_gemm.py`)

Exposed run_on_local_node: bool = False in all 5 GEMM benchmark signatures and their metrics calculation functions.
Passed local_mesh=run_on_local_node to create_mesh.
Dynamically determined the active device_count (jax.local_device_count() if run_on_local_node is True, else jax.device_count()) and passed it to handle_based_on_sharding to ensure accurate FLOPs/throughput calculation.

Backward Compatibility & Safety

All other compute benchmarks (benchmark_compute.py, benchmark_inference_compute.py, etc.) remain untouched.
Because local_mesh defaults to False and device_count defaults to None (falling back to global device_count), all untouched benchmarks will continue to execute and report metrics across the full slice out-of-the-box.

linamy85 · 2026-06-05T06:09:12Z

Thanks @simrankaurb , Have we verified the functionality and metrics correctness on this change?

Also, should we also cover the HBM and H2D/D2H?

simrankaurb · 2026-06-08T07:55:35Z

Thanks for the comment @linamy85 ! I have tested the change and pasted all logs in go/tpu-single-host-analysis. HBM and H2D/D2H seem to be working on local devices only. Please take a look and let me know if we have any concerns.
Thanks!!

Thanks @simrankaurb , Have we verified the functionality and metrics correctness on this change?

Also, should we also cover the HBM and H2D/D2H?

linamy85 · 2026-06-09T06:02:39Z

            return value
        case ShardingStrategy.SHARDING_ON_ALL_DEVICES_WITH_M:
-            return value // jax.device_count()
+            return value // jax.local_device_count()


If we change this to local_device_count(), the full node-pool scenario might not work as intended. Should we use parameter to control this switch? For example,

In the benchmark_utils.py, we can accept device_count as parameter.

In the benchmark_gemm.py, we can add new parameter (e.g. run_on_local_node, default to False)

WDYT?

cc @simonleesyuan30

Thanks! Please let me know if the design proposed in this one pager sits well with the expected behaviour: https://docs.google.com/document/d/1y0K2VJT0BLYV2-ZqHlm0RDV6WD8jHvJ5eIr_KBbjohA

Implementation completed. PTAL!

Testing: Ran gemm for single host and full slice. Here are the configs and values recorded: https://paste.googleplex.com/6202599689814016
The difference between TFLOPs/s and TFLOPS/s/device can tell us the number of devices(8 and 16) in both cases.

simrankaurb added 2 commits June 8, 2026 14:17

Fix mesh creation to use local devices for single-host benchmarks

55d5c6e

Fix pylint errors in benchmark_collectives.py

c1da258

simrankaurb force-pushed the production-mesh-fix branch from 6b24c1a to c1da258 Compare June 8, 2026 14:18

linamy85 reviewed Jun 9, 2026

View reviewed changes

simrankaurb force-pushed the production-mesh-fix branch from 0cba43b to c3852dc Compare June 9, 2026 07:04

simrankaurb changed the title ~~Fix mesh creation to use local devices for single-host benchmarks~~ feat: Support configurable mesh locality and device count in TPU benchmarks Jun 9, 2026

fix: support configurable mesh locality and device count for GEMM

946541b

simrankaurb force-pushed the production-mesh-fix branch from c3852dc to 946541b Compare June 9, 2026 07:17

linamy85 merged commit f010c92 into AI-Hypercomputer:chs Jun 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support configurable mesh locality and device count in TPU benchmarks#125

feat: Support configurable mesh locality and device count in TPU benchmarks#125
linamy85 merged 3 commits into
AI-Hypercomputer:chsfrom
simrankaurb:production-mesh-fix

simrankaurb commented Jun 4, 2026 •

edited

Loading

Uh oh!

linamy85 commented Jun 5, 2026

Uh oh!

simrankaurb commented Jun 8, 2026

Uh oh!

linamy85 Jun 9, 2026

Uh oh!

simrankaurb Jun 9, 2026

Uh oh!

simrankaurb Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simrankaurb commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

1. Utility Changes (Ironwood/src/benchmark_utils.py)

2. Benchmark API Changes (Ironwood/src/benchmark_gemm.py)

Backward Compatibility & Safety

Uh oh!

linamy85 commented Jun 5, 2026

Uh oh!

simrankaurb commented Jun 8, 2026

Uh oh!

linamy85 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

simrankaurb Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

simrankaurb Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simrankaurb commented Jun 4, 2026 •

edited

Loading

1. Utility Changes (`Ironwood/src/benchmark_utils.py`)

2. Benchmark API Changes (`Ironwood/src/benchmark_gemm.py`)