docs: compliance audit module redesign plan#332
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold) | ||
| except (FileNotFoundError, ValueError) as e: | ||
| print(f"TEST04 ERROR: {e}", file=sys.stderr) | ||
| sys.exit(2) | ||
| verdict_path = write_verdict(result, output_dir or test04_dir) |
There was a problem hiding this comment.
The test04 CLI command only catches FileNotFoundError and ValueError when calling verify_test04_dirs. However, file reading inside verify_test04_dirs can raise other OSError subclasses like PermissionError. Furthermore, write_verdict is called outside the try-except block and can also raise OSError (e.g., if the directory is read-only or disk is full). Wrapping both calls in a try-except block that catches OSError ensures all setup/IO errors are handled gracefully and exit with code 2 as documented.
| try: | |
| result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold) | |
| except (FileNotFoundError, ValueError) as e: | |
| print(f"TEST04 ERROR: {e}", file=sys.stderr) | |
| sys.exit(2) | |
| verdict_path = write_verdict(result, output_dir or test04_dir) | |
| try: | |
| result = verify_test04_dirs(reference_dir, test04_dir, threshold=threshold) | |
| verdict_path = write_verdict(result, output_dir or test04_dir) | |
| except (OSError, ValueError) as e: | |
| print(f"TEST04 ERROR: {e}", file=sys.stderr) | |
| sys.exit(2) |
| report = Report.from_snapshot(json.loads(snapshot_path.read_bytes())) | ||
| qps = report.qps() |
There was a problem hiding this comment.
In _run_stats_from_dir, Report.from_snapshot is called directly on the loaded JSON. If the snapshot JSON is malformed or has an unexpected structure, Report.from_snapshot can raise KeyError or TypeError. These exceptions are not caught by verify_test04_dirs or the CLI's test04 command, leading to an unhandled crash with a traceback instead of a clean exit with code 2. Wrapping the call in a try-except block and raising a ValueError ensures these errors are handled gracefully.
try:
report = Report.from_snapshot(json.loads(snapshot_path.read_bytes()))
except (ValueError, KeyError, TypeError) as e:
raise ValueError(f"Failed to parse snapshot in {run_dir}: {e}") from e
qps = report.qps()…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # # 1. Reference performance run | ||
| # inference-endpoint benchmark from-config \ | ||
| # examples/09_Wan22_VideoGen_Example/offline_wan22.yaml | ||
| # | ||
| # # 2. This TEST04 run (audit: test04 issues one fixed prompt) | ||
| # inference-endpoint benchmark from-config \ | ||
| # examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml | ||
| # | ||
| # # 3. Compare throughput (videos/s) and write the verdict | ||
| # python -m inference_endpoint.compliance test04 \ | ||
| # logs/wan22_video_generation_benchmark \ | ||
| # logs/wan22_video_generation_test04 |
There was a problem hiding this comment.
This doesn't seem the best flow to run audit tests. In Test04, only partial of the original queries will be sent (25/50?) and same in repeated tests, and you should be running 1 command instead of 2 to avoid underlying endpoints to change (back to back). You would be forced to run the whole 248 queries if you design this way
My thinking is that you should define either the loadgen (with some audit pattern) or datasets (with type: audit) to run this behavior, and inference endpoints will internally run 2 stages and compare the perf. Right now the workflow is segregated and seems clunky to run.
@arekay-nv @nv-alicheng to share your opinions as well
Update summaryAll review feedback has been addressed. Here is what changed since the original submission: Architecture (main concern)
Config shape
audit: "test04"
datasets:
- name: wan22_prompts
path: wan22_prompts.jsonl
type: "performance"
samples: 50 # reference phase query count (50–144)
- name: wan22_audit
path: wan22_prompts.jsonl
type: "audit"
samples: 25 # audit phase query count (25–50)
audit_sample_index: 0Robustness
Testing
Example config
|
9057190 to
b547f1d
Compare
cdbae64 to
eae1234
Compare
| - name: wan22_prompts | ||
| path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl | ||
| type: "audit" | ||
| ref_samples: 50 # reference phase query count |
There was a problem hiding this comment.
I thought we choose the sample ref_samples and audit_samples - is it intended to be 50 and 25?
There was a problem hiding this comment.
Sorry, I am still a bit confused - we are comparing the QPS of 50 dataset samples with 20 repeated samples? That doesn't seem fair - can you remind me when the decision was made?
There was a problem hiding this comment.
In wan22, ref samples is current set as 248 and audit samples is 20.
| @@ -0,0 +1,329 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
This is an odd file name. test_audit_test04 probably
b190d21 to
e0de06f
Compare
There was a problem hiding this comment.
change max_duration_ms to a reasonable number.
385630c to
c1e48bf
Compare
|
All review feedback has been addressed. Here's a summary of what changed: Architecture Sample counts & index SingleStream Durations Robustness fixes (Gemini)
Cleanup
|
| @@ -0,0 +1,60 @@ | |||
| # Offline TEST04 Compliance Run for WAN 2.2 (GB200/GB300) | |||
There was a problem hiding this comment.
The naming of the file is not consistent. some have offline/singlestream, some have audit_test04. Can you fix it?
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles design review
Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough
Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.
| other_datasets = [d for d in config.datasets if d.type != DatasetType.AUDIT] | ||
|
|
||
| ref_perf_ds = audit_ds.model_copy( | ||
| update={"type": DatasetType.PERFORMANCE, "samples": audit_ds.ref_samples} |
There was a problem hiding this comment.
[Claude] high · bug — ref_samples is a dead write.
ref_perf_ds = audit_ds.model_copy(update={..., "samples": audit_ds.ref_samples}) sets the Dataset.samples field (schema.py:308), but that field is consumed nowhere in src/: DataLoaderFactory.create_loader reads path/format/parser/name only, and there is no .samples read anywhere. Meanwhile ref_config (the lines just below) never sets n_samples_to_issue. So with the shipped example (min_duration_ms: 600000) the reference phase runs duration-driven and silently ignores ref_samples, while the audit phase does honor audit_samples (it sets n_samples_to_issue=audit_n). The two phases the verdict compares therefore run mismatched counts, and the PR's "ref_samples = reference phase query count" is false. Fix: set n_samples_to_issue=ref_samples on ref_config.settings.runtime exactly as the audit phase does (and drop the no-op samples= updates).
|
|
||
| """MLPerf compliance checks for endpoint benchmarking.""" | ||
|
|
||
| from .test04 import ( |
There was a problem hiding this comment.
[Claude] high · design — No AuditTest abstraction; TEST04 is hardcoded end-to-end.
The compliance package exports only test04-named symbols, AuditMode has a single value, run_benchmark hardcodes if config.audit == AuditMode.TEST04: _run_benchmark_test04(...) (execute.py:1228), and the orchestrator is bespoke. Why re-design: the stated goal is a modular framework, but adding TEST01/TEST05 here means a cross-cutting edit of the enum + schema + load generator + verifier + orchestration for every new audit. Suggested shape: an AuditTest protocol in compliance/ — plan_runs(config) -> list[RunSpec] and verify(reports) -> Verdict — registered by AuditMode; run_benchmark dispatches audit.plan_runs(...) generically and never names a specific test.
|
|
||
| PERFORMANCE = "performance" | ||
| ACCURACY = "accuracy" | ||
| AUDIT = "audit" |
There was a problem hiding this comment.
[Claude] high · design — DatasetType.AUDIT is an abstraction leak.
It is a dataset type the normal loader deliberately ignores-with-warning (_load_datasets, execute.py:256), exists only to carry TEST04 scalars (ref_samples/audit_samples/audit_sample_index, schema.py:321-335) on the shared Dataset model used by perf/accuracy datasets too, and is immediately model_copy(update={"type": PERFORMANCE})-converted in _run_benchmark_test04. Why re-design: a fake dataset type the loader can't load, smuggling test-specific params onto the shared schema, forces every future audit to either piggyback more Dataset fields or invent another fake type. Suggested shape: drop DatasetType.AUDIT; move the audit params into a structured audit: sub-config block, and let the audit's plan_runs derive each phase from a normal PERFORMANCE dataset + that block.
|
|
||
| # kw_only so trailing defaults here don't force defaults on subclass fields | ||
| # (e.g. _RuntimeSettings.model in rulesets/mlcommons/rules.py). | ||
| test04: bool = field(default=False, kw_only=True) |
There was a problem hiding this comment.
[Claude] high · design — TEST04 leaks into the general-purpose load generator.
test04: bool and test04_sample_index: int are TEST04-specific fields on the frozen, general-purpose RuntimeSettings, and create_sample_order dispatches on settings.test04 (sample_order.py:128). Why re-design: each future audit would add its own boolean + factory branch into the hot-path config object and the ordering factory — TEST04 knowledge is now wired into core load-gen. Suggested shape: replace the per-test boolean with a generic ordering selector (e.g. a SampleOrderSpec strategy descriptor: WITHOUT_REPLACEMENT | SINGLE(index=...)); create_sample_order switches on that, and the audit populates it via its run plan. The load generator then knows nothing about "test04".
| # Defined as a constant so the sender (_run_benchmark_test04) and the | ||
| # receiver (setup_benchmark) stay in sync without relying on matching | ||
| # string literals. | ||
| _OVERRIDE_TEST04_SAMPLE_INDEX = "test04_sample_index" |
There was a problem hiding this comment.
[Claude] medium · design — _OVERRIDE_TEST04_SAMPLE_INDEX is a stringly-typed magic kwarg.
It is threaded through **runtime_overrides from _run_benchmark_test04 (line ~1201) into setup_benchmark, which reaches into the opaque dict via runtime_overrides.get(_OVERRIDE_TEST04_SAMPLE_INDEX, 0) (~line 492) and forwards the rest into RuntimeSettings.from_config. Why re-design: hidden coupling — caller and callee must agree on a dict key, the value bypasses the typed config path, and the signature gives no hint an audit index can arrive. Suggested shape: pass a typed run_spec: AuditRunSpec | None so the sample index / ordering strategy are explicit, type-checked fields.
| assert (tmp_path / "test04").is_dir() | ||
| assert "TEST04 audit: phase 1/2" in caplog.text | ||
| assert "TEST04 audit: phase 2/2" in caplog.text | ||
| verdict_written = (tmp_path / "test04_verdict.json").exists() |
There was a problem hiding this comment.
[Claude] medium · testing — The TEST04 orchestrator is effectively untested.
_run_benchmark_test04 (the model_copy surgery, ref_samples/audit_samples wiring, the pre-flight bounds loop, and the three phase-failure early-returns) has no unit test. The only coverage is this integration test, whose terminal assertion is lenient — assert verdict_written or comparison_errored — and it uses min_duration_ms=0, the one regime where the ref_samples dead-write doesn't bite. So it cannot catch the ignored-ref_samples bug, a swapped count, or a wrong verdict. Add unit tests asserting the reference phase issues ref_samples and the audit phase issues audit_samples, plus that each early-return path is hit.
| Literal[TestType.OFFLINE, TestType.ONLINE] | None, | ||
| cyclopts.Parameter(show=False), | ||
| ] = None | ||
| audit: Annotated[ |
There was a problem hiding this comment.
[Claude] medium · design — audit is a bare top-level enum with params scattered.
audit: AuditMode | None sits on BenchmarkConfig while its parameters live on the Dataset model (ref_samples/audit_samples/audit_sample_index) and its threshold is hardcoded in test04.py (DEFAULT_THRESHOLD = 0.10). Why re-design: as audits accumulate, a flat enum + dataset-borne params has nowhere to put per-test knobs and the YAML loader can't validate that the right params accompany the chosen audit. Suggested shape: a structured audit:/compliance: sub-config discriminated on the audit name, co-locating the selection + its parameters (threshold, counts, repeat index) — mirroring accuracy_config.
| ) | ||
|
|
||
|
|
||
| def verify_test04_from_reports( |
There was a problem hiding this comment.
[Claude] medium · design — QPS is compared across phases that issue different counts/contents.
The reference issues ref_samples distinct samples; the audit issues audit_samples of one repeated sample (example 50 vs 20). Upstream MLCommons TEST04 keeps the query set/count identical between the two runs. Comparing Report.qps() across runs that differ in count, content, and duration is not apples-to-apples — a short audit can post a higher instantaneous QPS for reasons unrelated to caching (ramp-up amortization, fewer tail effects). The completion-count guard (test04.py:128) only rejects the audit completing far fewer than the reference; it does nothing about the count/duration mismatch in the PASS direction. (Extends the maintainer's existing fairness thread; compounded by the ref_samples dead-write, which makes the reference duration-driven.) Suggested shape: drive both phases from one count (default audit_samples == ref_samples) and have the verdict assert that precondition.
| return qps, report.n_samples_completed | ||
|
|
||
|
|
||
| def verify_test04_dirs( |
There was a problem hiding this comment.
[Claude] low · design — Two parallel verifier entry points.
verify_test04_dirs (disk, used by the CLI) and verify_test04_from_reports (in-memory, used by the integrated flow) both end in verify_test04, but only the dirs path carries the swapped-directory audit-marker guard (lines 185-194). Why re-design: a future audit must reimplement both surfaces and behavior can drift (the guard already exists in one but not the other). Suggested shape: one verify(ref_stats, audit_stats) core taking (qps, completed) tuples, with thin Path/Report adapters; the directory-order guard becomes an explicit precondition on the adapter.
| f"audit test04 requires max_throughput (offline) or concurrency (single-stream); " | ||
| f"a paced load ({load_pattern_type.value}) caps throughput and hides caching." | ||
| ) | ||
| if not 0 <= test04_sample_index < n_samples: |
There was a problem hiding this comment.
[Claude] low · bug — audit_sample_index is bound-checked against requested counts, not the loaded dataset size, until phase 2.
_validate_audit_test04 checks the index against the loaded dataset size, but only runs for the audit phase (the ref phase has audit=None and returns early). The schema field (ge=0 only) and the pre-flight in _run_benchmark_test04 check against ref_samples/audit_samples (requested counts), never the actual loaded row count. So audit_sample_index valid vs audit_samples but >= the real dataset size only fails inside the audit phase's SingleSampleOrder construction — after a full reference run has already executed and wasted time. Consider validating against the loaded dataset size up front.
Review Council — Multi-AI Code Review (first-principles design review)Reviewed by: Claude · Depth: thorough Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline. 🔴 Re-design / Must-fix
🟡 Should-fix
🔵 Consider
Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance. |
| target_concurrency: 1 # SingleStream: one request in-flight at a time | ||
|
|
||
| client: | ||
| num_workers: 1 |
There was a problem hiding this comment.
any specific reason to set num_workers?
can we use default instead?
(also in the other config yamls in this MR)
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5540585 to
9a0bb41
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9a0bb41 to
4290b36
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4290b36 to
d34459b
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d34459b to
642ec6c
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
642ec6c to
b40b0ef
Compare
submission_wan22_offline.yaml drives all three from one config: the perf run, VBench accuracy scoring, and the additive TEST04 audit. The perf + accuracy portion mirrors offline_wan22_accuracy.yaml; the audit: block reflects the proposed schema in docs/compliance_audit_plan.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lure handling Verification against main found three doc issues: - min_duration_ms is NOT a duration floor — the session stop check honors sample count and max_duration only, so the '10-min compliance minimum' claim was false. Documented as a current limitation (AND-semantics is future work) and dropped the misleading example comments. - Specified verdict -> exit-code propagation (run_benchmark returns None today; the audit path must return the AuditVerdict for cli.py to sys.exit). - Specified that a failed phase aborts without verifying (exit 2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-litigated the comparison basis during plan review: keep equal-counts (both phases issue one shared sample count) over duration-normalized, and record it as an explicit decision with rationale (qps is rate-normalized so count-equality is a simplification not a correctness requirement; duration floor not yet supported), chosen defaults (offline 64, single-stream 20), and the upgrade path. Align generic examples to samples=64. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The design doc referred to _run_audit(config, test_mode); the implementation plan uses run_audit(config) (audit is performance-only, so test_mode is irrelevant, and the function is imported cross-module so a public name fits). Standardize the design doc on run_audit(config). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t config union Reverses the equal-count basis per design review: AuditConfig becomes a per-test discriminated union (Test04Config) carrying samples (reference) + audit_samples (audit, defaults to samples); verify drops the count-match guard in favor of a per-phase completion guard + qps rate-normalization. Flags the reviewer-conflict (this is the unequal-count pattern the maintainer questioned). Adds submission_wan22_singlestream.yaml; sweeps the doc for stale equal-count statements. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
submission_wan22_offline.yaml -> offline_wan22_submission.yaml submission_wan22_singlestream.yaml -> single_stream_wan22_submission.yaml Matches the existing offline_wan22.yaml / single_stream_wan22.yaml naming. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ep audit_samples capability Sets the WAN2.2 examples to equal counts (maintainer-aligned on fairness). audit_samples is retained as an opt-in to shorten the audit phase, justified by upstream TEST04 (SDXL 5000 reference / 500 audit per compliance/nvidia/ TEST04/audit.config). Drops the reviewer-conflict framing; verify keeps the per-phase completion guard (works for equal or unequal counts). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… header Review Council (Claude) findings on PR mlcommons#332: - examples hardcoded num_workers despite §8 claiming it was dropped; remove it (use endpoint default, per viraatc's request) so the traceability row is true - single-stream header said '(independent counts)' but uses equal 20/20; align to '(equal counts here)' matching the offline sibling and §5/§8 framing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… PR) Review Council (Claude, 2nd pass) on PR mlcommons#332: three §8 traceability rows described the example YAMLs as future ('land/dropped at implementation', 'plan doc only'), but the PR now ships offline_wan22_submission.yaml and single_stream_wan22_submission.yaml. Reword to reflect they're included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
viraatc
left a comment
There was a problem hiding this comment.
lgtm, looking forward to impl!
| > **LLM nuance.** MLPerf exempts variable-length-input LLMs from TEST04 because prefix | ||
| > caching legitimately speeds up identical prompts. On an LLM endpoint, TEST04 will see | ||
| > real prefix-cache gains; the tolerance (and whether the audit run disables prefix cache) | ||
| > is a deliberate knob. We build it faithfully to the reference (±10% / ±20%) and expose |
There was a problem hiding this comment.
We shouldn't enable TEST04 for LLM.
| | LoadGen runs both phases of a test | a **generic orchestrator** runs `plan_runs()` back-to-back | | ||
| | compliance submission dir layout | mirrored under the run's report dir | | ||
|
|
||
| ### Feasibility note for the token-level tests (TEST06/09) |
There was a problem hiding this comment.
Let's remove this section, not relevant at the moment
| # TEST04 caching audit — additive post-step. Runs its OWN reference + fixed-sample | ||
| # phases at equal counts (the audit count may be lowered to shorten the phase). | ||
| audit: | ||
| test: "test04" |
There was a problem hiding this comment.
I would encourage that let's change the TEST04 to a more meaningful name (like output_caching_tests). It has been years when we talked about buzz word like TEST01 TEST04 and people are confused what we are doing
We can map the log file back to TEST04 legacy in loadgen, but endpoints will be the future way to go
There was a problem hiding this comment.
I agree to this idea, maybe output no-caching test. I can prepare some slides to discuss if needed (maybe targeting next week WG meeting.
| │ | ||
| │ 3. verify(runs) ; 4. write_verdict (atomic) | ||
| ▼ | ||
| verify_TEST04.txt + audit_verdict.json |
There was a problem hiding this comment.
If possible, unify the output to be json so it's easier to parse.
And do we need 2 files? Ideally one json should be enough (containing all information). On traditional MLPerf side we can always use scripts to make it pass
| │ back-to-back, same endpoint | ||
| ├─ Phase 2 "test04" ─ 64 × sample[0] ────────► RunArtifacts[1] (qps_audit) | ||
| │ | ||
| ├─ verdict = Test04Audit.verify([ref, audit]) |
There was a problem hiding this comment.
Not sure verdict is the right word to use (I know AI loves this word). Maybe auditResults will be more explicit and lower the entropy of understanding
|
|
||
| ```python | ||
| class AuditTestId(str, Enum): | ||
| TEST04 = "test04" |
There was a problem hiding this comment.
Again, recommend using an explicit name like output_caching_audit
Summary
This PR is a design proposal (
docs/compliance_audit_plan.md), not an implementation. The prior TEST04 implementation was reset after the design review on this PR; this redesign folds in that feedback plus the maintainer's workflow requirements before any code is written.The plan describes a modular, extensible compliance/audit framework with TEST04 as the first concrete test (targeting WAN2.2-T2V) and TEST01/06/07/09 as documented extensions.
What the design specifies
AuditTestprotocol (plan_runs+verify), resolved from a registry. A generic orchestrator runs the planned phases back-to-back and never names a specific test. Adding a test = one file + one enum value + one import line.audit:block runs the reference and audit phases back-to-back at a single shared sample count — addressing the fairness concern (no moreref=50/audit=20).test04boolean, noDatasetType.AUDIT; run modification is a generic typedSampleOrderSpec, and audit params live in a structuredAuditConfigsub-model (mirroringAccuracyConfig).RunSpecreplaces the stringly-typed override kwarg; all specs validated against the loaded dataset size, and paced loads rejected, before any run executes.from_dirre-check path.audit:is additive — a singletype: submissionYAML runs perf + accuracy + audit under onereport_dir.audit.configknobs (performance_issue_same_index=3, tolerance,min_query_count) mapped toAuditConfig.Diagrams & traceability
Status
Seeking design buy-in before implementation. Once acknowledged, the implementation lands following §6 (file-by-file) and §9 (success criteria).
🤖 Generated with Claude Code