Make build throughput ignore skipped images by simonrosenberg · Pull Request #552 · OpenHands/benchmarks

simonrosenberg · 2026-03-22T13:51:59Z

Summary

make throughput mean built images/hour everywhere
keep built/skipped/failed counts explicit in batch and final logs
store a single built-only throughput metric in build-summary.json

Why

Skipped images should not inflate throughput. The build logs and summary should show skips clearly, but any throughput number used to reason about build efficiency should be computed from non-skipped images only.

Changes

batch logs now print built, skipped, failed, and a single built-only throughput
final logs now print built, skipped, failed, retried, and a single built-only throughput
build-summary.json now stores throughput_images_per_hour, computed from built images only
markdown summaries now show a single throughput value based on built images only

Validation

python3 -m ruff check benchmarks/utils/build_manifest.py benchmarks/utils/build_utils.py tests/test_build_manifest.py tests/test_image_utils.py
python3 -m ruff format benchmarks/utils/build_manifest.py benchmarks/utils/build_utils.py tests/test_build_manifest.py tests/test_image_utils.py --check
PYTHONPATH=/Users/simonrosenberg/tmp/benchmarks-telemetry-clarity python3 -m pytest tests/test_build_manifest.py tests/test_image_utils.py -k 'build_manifest or built_images_for_throughput'

Notes

The repo's pre-commit hook could not run in this detached worktree because uv could not resolve the local openhands-sdk workspace member from this checkout shape. The targeted lint/tests above were run directly instead.

all-hands-bot

🟡 Acceptable - Pragmatic fix for a real observability problem, but has inefficiency that should be cleaned up.

VERDICT: ✅ Worth merging after addressing the double-calculation issue

KEY INSIGHT: The change correctly separates "images processed" from "images built" in throughput metrics, preventing fast skips from masking slow builds. However, the implementation calculates throughput twice and throws away the first result.

all-hands-bot · 2026-03-22T15:04:10Z

benchmarks/utils/build_utils.py

        [result.model_dump(mode="json") for result in results],
        manifest_files=1 if results else 0,
    )
-    summary_file.write_text(summary.model_dump_json(indent=2), encoding="utf-8")
    overall_duration = time.monotonic() - overall_started_monotonic
-    throughput = (
-        (len(base_images) / overall_duration) * 3600 if overall_duration else 0.0
+    summary.wall_clock_seconds = _round_duration(overall_duration)
+    summary.throughput_images_per_hour = (
+        (built / overall_duration) * 3600 if overall_duration else 0.0
    )


🟠 Important: Wasteful double calculation - you call summarize_build_records() which parses timestamps and calculates throughput, then immediately overwrite those values with monotonic timer measurements.

Better approach: Pass wall_clock_seconds as an optional parameter to summarize_build_records() so it can skip the timestamp calculation when you already have accurate timing from monotonic timers.

summary = summarize_build_records( [result.model_dump(mode="json") for result in results], manifest_files=1 if results else 0, wall_clock_seconds=overall_duration, # Pass it in )

Then in summarize_build_records(), only parse timestamps if wall_clock_seconds is None. This is the issue the previous reviewer flagged.

all-hands-bot · 2026-03-22T15:04:10Z

benchmarks/utils/build_manifest.py

        ).total_seconds()
+    throughput_images_per_hour = None
+    if wall_clock_seconds and wall_clock_seconds > 0:


🟡 Suggestion: Consider whether storing both wall_clock_seconds and throughput_images_per_hour creates unnecessary redundancy. One can always be derived from the other + built count.

However, if this is for convenience (avoiding recalculation in consumers), that's pragmatic - just ensure they stay consistent. The current implementation is fine if you address the double-calculation issue above.

all-hands-bot · 2026-03-22T15:04:10Z

tests/test_image_utils.py

+        @contextlib.contextmanager
+        def fake_prepare_cached_sdist():
+            yield None
+
+        class FakeFuture:
+            def __init__(self, result: BuildOutput):
+                self._result = result
+
+            def result(self) -> BuildOutput:
+                return self._result
+
+        class FakeExecutor:
+            def __init__(self, *args, **kwargs):
+                pass
+
+            def __enter__(self):
+                return self
+
+            def __exit__(self, exc_type, exc, tb):
+                return False
+
+            def submit(self, fn, **kwargs):
+                base = kwargs["base_image"]
+                if base == "base-2":
+                    result = BuildOutput(
+                        base_image=base,
+                        tags=[f"tag:{base}"],
+                        error=None,
+                        status="skipped_remote_exists",
+                        skip_reason="remote_image_exists",
+                        duration_seconds=1.0,
+                    )
+                else:
+                    result = BuildOutput(
+                        base_image=base,
+                        tags=[f"tag:{base}"],
+                        error=None,
+                        status="built",
+                        duration_seconds=10.0,
+                    )


🟢 Acceptable: Heavy use of mocks (FakeFuture, FakeExecutor) tests implementation details rather than real behavior. Normally I'd say "mocks aren't tests," but given that the real code path runs Docker builds, this is a pragmatic trade-off to keep tests fast and deterministic.

The test does verify the actual calculation logic (2 built / 45s = 160 images/hour), which is what matters here.

Clarify build throughput and cache telemetry

a731135

This comment was marked as outdated.

Sign in to view

Make throughput count built images only

f565a99

simonrosenberg changed the title ~~Clarify build throughput and cache telemetry~~ Make build throughput ignore skipped images Mar 22, 2026

simonrosenberg requested a review from all-hands-bot March 22, 2026 15:02

all-hands-bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make build throughput ignore skipped images#552

Make build throughput ignore skipped images#552
simonrosenberg wants to merge 2 commits intomainfrom
fix/clarify-build-telemetry

simonrosenberg commented Mar 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 22, 2026

Uh oh!

all-hands-bot Mar 22, 2026

Uh oh!

all-hands-bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Validation

Notes

Uh oh!

This comment was marked as outdated.

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonrosenberg commented Mar 22, 2026 •

edited

Loading