feat(evaluation): add VLMMetrics by davidberenstein1957 · Pull Request #545 · PrunaAI/pruna

davidberenstein1957 · 2026-02-21T05:46:23Z

Add ImageRewardMetric for evaluating image-text alignment using ImageReward library.

src/pruna/evaluation/metrics/vlm_base.py

src/pruna/evaluation/metrics/metrics_vlm.py

src/pruna/evaluation/metrics/metric_arniqa.py

tests/evaluation/test_task.py

pyproject.toml

begumcig

Thank you so much David! I have asked some questions because I am not really familiar with litellm and the requirements using this specific framework brings. I think the metric classes themselves look really good, just needs some tweaks here and there! And really apologies for the delay in reviewing 💞

src/pruna/evaluation/metrics/metric_alignment_score.py

src/pruna/evaluation/metrics/metric_img_edit_score.py

src/pruna/evaluation/metrics/metric_qa_accuracy.py

src/pruna/evaluation/metrics/metric_text_score.py

src/pruna/evaluation/metrics/vlm_utils.py

src/pruna/evaluation/metrics/metric_viescore.py

pyproject.toml

codacy-production · 2026-04-01T13:47:57Z

Not up to standards ⛔

🔴 Issues 2 critical · 16 high · 19 medium · 63 minor

Alerts:
⚠ 100 issues (≤ 0 issues of at least minor severity)

Results:
100 new issues

Category Results

UnusedCode 1 medium

Documentation 62 minor

ErrorProne 2 high

Security 14 high

CodeStyle 1 minor

Complexity 2 critical
18 medium

View in Codacy

🟢 Metrics 462 complexity · 83 duplication

Metric Results

Complexity 462

Duplication 83

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

begumcig

Hey David, great updates on these! I left some specific comments below, but the general theme is that there’s a bit of a mismatch between the current implementation and the original research methodologies.

The VLM infrastructure you’ve built is a great foundation, but we’re currently missing some of the logic that actually defines these benchmarks. Without those steps, our results might be hard to compare with the official papers. Let me know what you think!!

src/pruna/data/collate.py

src/pruna/data/pruna_datamodule.py

src/pruna/data/utils.py

src/pruna/evaluation/metrics/metric_alignment_score.py

src/pruna/evaluation/metrics/metric_qa_accuracy.py

src/pruna/evaluation/benchmarks.py

src/pruna/evaluation/metrics/metric_text_score.py

src/pruna/evaluation/metrics/vlm_utils.py

src/pruna/evaluation/metrics/metric_vie_score.py

… support - Add vlm_base.py with LitellmVLM and TransformersVLM - Add metrics_vlm.py with VLM-based metrics: - VQAMetric - AlignmentScoreMetric - ImageEditScoreMetric - QAAccuracyMetric - TextScoreMetric - VieScoreMetric - Uses litellm (default gpt-4o) or local transformers models

ARNIQA is not available in torchmetrics 1.7.4. Implementing simplified version with optional pretrained weight loading.

- Use scores: List[float] instead of tensor total/count - Add default_call_type and runs_on attributes - Match SharpnessMetric pattern

The async version was returning a coroutine instead of the actual response, causing all VLM metrics to silently fail.

- Add pydantic models for structured output (VQAnswer, ScoreOutput) - LitellmVLM: Use response_format parameter for stable outputs - TransformersVLM: Add outlines support for constrained decoding - Add structured_output flag to all VLM metrics - Add proper paper references (VQAScore, VieScore) - Add pydantic>=2.0.0 to dependencies

- Add docstrings to update/compute methods - Fix type hints - Add ruff fixes

- Add PIL import at top - Fix type hints - D205 docstring issues are from multi-line examples

The metrics_vlm module uses a different docstring pattern for VLM parameters that doesn't fit numpydoc's PR01 check. Skip this check for the new VLM metrics.

- Added detailed parameter descriptions to VQAnswer, ScoreOutput, and various metric classes in metrics_vlm.py. - Updated docstrings in base classes of vlm_base.py to include parameter details and return types. - Improved clarity and consistency across all metric-related docstrings.

- Added new metrics: AlignmentScoreMetric, ImageEditScoreMetric, QAAccuracyMetric, TextScoreMetric, VieScoreMetric, and VQAMetric for comprehensive evaluation of image-text alignment and quality. - Implemented integration test script for VLM metrics, allowing testing against both Litellm and Transformers backends. - Updated pyproject.toml to reflect new dependencies and changes in optional dependencies. - Added documentation for prompt comparisons between Pruna and InferBench implementations.

…m docstrings - VieScore: docstring arXiv:2312.14867, TIGER-AI-Lab/VIEScore - Image Edit Score: docstring EditScore, ADIEE - VQA: docstring arXiv:2404.01291, use_probability=True default - vlm_base: full Parameters/Returns for score(), _score_with_logprobs Made-with: Cursor

- Added docstrings to the update and compute methods for AlignmentScoreMetric, ImageEditScoreMetric, QAAccuracyMetric, TextScoreMetric, VieScoreMetric, and VQAMetric to improve clarity on their functionality. - Updated the test suite to ensure compatibility with new metric requirements.

- Enhanced the type hints for the response_format parameter in BaseVLM, LitellmVLM, and TransformersVLM classes to include Literal types ("integer", "yes_no") alongside the existing Type[BaseModel]. - Updated docstrings to reflect the new response_format options, improving clarity on expected input types and usage.

Made-with: Cursor

- Join LongText-Bench list text_content before OCR scoring - Reduce datamodule benchmark tests (category smoke, prompt aux merge) - Trim VLM metric tests; drop slow mark on mocked GenEval task test Made-with: Cursor

- Added detailed docstrings for class methods to clarify functionality and usage. - Simplified error messages for unsupported model configurations. - Improved file handling for loading configuration files with explicit encoding. - Streamlined code formatting for better readability and consistency. Made-with: Cursor

- Added detailed docstrings for functions and classes to clarify their purpose and usage. - Updated version check functions to specify the required version of the `transformers` package. - Introduced new classes for modified Llama attention and decoder layers to support bidirectional encoding. - Improved error handling in the Llama encoder model for unsupported transformer versions. Made-with: Cursor

…rkRegistry - Updated type hints in the BenchmarkRegistry and LLM2Vec classes for better clarity and compatibility. - Enhanced the batch_to_device function to accept both device strings and device types. - Improved handling of optional parameters in LLM2Vec methods to prevent potential errors. - Added type casting for better type safety in the bidirectional Llama model. Made-with: Cursor

- Updated import path for LlamaBiModel to reflect new module structure. - Improved docstrings across various classes and methods to provide clearer descriptions and parameter details. - Ensured consistency in return type annotations and parameter specifications for better code readability. Made-with: Cursor

- Renamed and refactored dataset setup functions for clarity and consistency, including the introduction of `_setup_oneig_subset_with_fixed_category`. - Added new functions for loading specific OneIG datasets with fixed categories, improving usability. - Introduced a new module for VLM benchmark integration, providing shared helpers and metrics for evaluation. - Enhanced docstrings across various functions to clarify parameters and return types, ensuring better documentation and understanding. Made-with: Cursor

llcnt · 2026-04-10T13:52:41Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: aggregation parameter stored but never applied in scoring
- QAAccuracyMetric now validates and applies aggregation, using strict binary scoring for all_or_nothing instead of always averaging per-question scores.
✅ Fixed: Chinese language heuristic misclassifies EN-only rows
- The OneIG language heuristic now only infers Chinese from prompt-only rows when actual CJK characters are present, preventing EN prompt-only rows from being routed to _zh Q_D files.
✅ Fixed: Text score uses unnormalized distance favoring short texts
- TextScoreMetric now stores normalized Levenshtein distance by dividing by ground-truth character length so the reported mean matches character error rate behavior across varying text lengths.

Or push these changes by commenting:

@cursor push 6e88aefdac

Preview (6e88aefdac)

diff --git a/src/pruna/data/datasets/prompt.py b/src/pruna/data/datasets/prompt.py
--- a/src/pruna/data/datasets/prompt.py
+++ b/src/pruna/data/datasets/prompt.py
@@ -139,11 +139,15 @@
     lang = row.get("language") or row.get("lang")
     if isinstance(lang, str) and lang.lower() in {"zh", "zh-cn", "zh_cn", "chinese", "cn"}:
         return True
-    if row.get("prompt_zh"):
+    if row.get("prompt_zh") or row.get("prompt_cn"):
         return True
     prompt = row.get("prompt")
     prompt_en = row.get("prompt_en")
-    return bool(prompt and not (isinstance(prompt_en, str) and prompt_en.strip()))
+    if not (isinstance(prompt, str) and prompt.strip()):
+        return False
+    if isinstance(prompt_en, str) and prompt_en.strip():
+        return False
+    return any("\u4e00" <= ch <= "\u9fff" for ch in prompt)
 
 
 def _oneig_qd_prefix(row: dict) -> str:

diff --git a/src/pruna/evaluation/metrics/metric_qa_accuracy.py b/src/pruna/evaluation/metrics/metric_qa_accuracy.py
--- a/src/pruna/evaluation/metrics/metric_qa_accuracy.py
+++ b/src/pruna/evaluation/metrics/metric_qa_accuracy.py
@@ -105,6 +105,11 @@
         self.call_type = get_call_type_for_single_metric(call_type, self.default_call_type)
         self.add_state("scores", [])
         self.aggregation = kwargs.pop("aggregation", "mean")
+        if self.aggregation not in {"mean", "all_or_nothing"}:
+            raise ValueError(
+                "qa_accuracy aggregation must be one of {'mean', 'all_or_nothing'}. "
+                f"Got: {self.aggregation!r}."
+            )
 
     def _extract_questions(self, gt: Any, n: int) -> List[List[str]]:
         if isinstance(gt, (list, tuple)) and len(gt) >= n:
@@ -151,7 +156,10 @@
                 ["Yes"] * len(questions),
                 response_format=self.response_format,
             )
-            score = float(np.mean(scores))
+            if self.aggregation == "all_or_nothing":
+                score = float(all(float(s) == 1.0 for s in scores))
+            else:
+                score = float(np.mean(scores))
             self.scores.append(score)
 
     def compute(self) -> MetricResult:

diff --git a/src/pruna/evaluation/metrics/metric_text_score.py b/src/pruna/evaluation/metrics/metric_text_score.py
--- a/src/pruna/evaluation/metrics/metric_text_score.py
+++ b/src/pruna/evaluation/metrics/metric_text_score.py
@@ -172,7 +172,7 @@
 @MetricRegistry.register("text_score")
 class TextScoreMetric(_BaseVLMOCRTextMetric):
     """
-    OCR then mean Levenshtein distance to ground truth (lower is better).
+    OCR then mean normalized Levenshtein distance (character error rate, lower is better).
 
     Registry: ``ocr_levenshtein`` (descriptive) and ``text_score`` (legacy).
 
@@ -240,7 +240,8 @@
     def _accumulate_sample(self, text_gt: str, ocr_text: str) -> None:
         norm_gt = normalize_text_simple(text_gt)
         norm_ocr = normalize_text_simple(ocr_text)
-        self.scores.append(levenshtein(norm_ocr, norm_gt))
+        gt_len = max(len(norm_gt), 1)
+        self.scores.append(float(levenshtein(norm_ocr, norm_gt) / gt_len))
 
     def _compute_result_value(self) -> float:
         if not self.scores:

diff --git a/tests/data/test_oneig_loader.py b/tests/data/test_oneig_loader.py
--- a/tests/data/test_oneig_loader.py
+++ b/tests/data/test_oneig_loader.py
@@ -34,6 +34,18 @@
     assert prompt_mod._oneig_qd_prefix(row) == "anime_zh"
 
 
+def test_oneig_qd_prefix_prompt_only_en_row_stays_en() -> None:
+    """Prompt-only EN rows must not be misclassified as Chinese."""
+    row = {
+        "category": "General_Object",
+        "id": "001",
+        "prompt": "a red apple on a table",
+        "prompt_en": "",
+        "class": "None",
+    }
+    assert prompt_mod._oneig_qd_prefix(row) == "object"
+
+
 def test_to_oneig_record_multilingualism_fills_questions() -> None:
     """Synthetic Multilingualism row resolves Q_D from merged index."""
     qb = {"multilingualism_zh_000": {"questions": {"1": "现场是不是颁奖典礼？"}, "dependencies": {"1": [0]}}}

diff --git a/tests/evaluation/test_vlm_metrics.py b/tests/evaluation/test_vlm_metrics.py
--- a/tests/evaluation/test_vlm_metrics.py
+++ b/tests/evaluation/test_vlm_metrics.py
@@ -146,6 +146,22 @@
 
 
 @pytest.mark.cpu
+def test_qa_accuracy_aggregation_modes() -> None:
+    mock_vlm = MagicMock(spec=BaseVLM)
+    mock_vlm.score.return_value = [1.0, 0.0]
+    images = _dummy_image(batch=1)
+    aux = [{"questions": {"1": "Q1", "2": "Q2"}}]
+
+    mean_metric = QAAccuracyMetric(vlm=mock_vlm, vlm_type="litellm", device="cpu", aggregation="mean")
+    mean_metric.update(["a prompt"], aux, images)
+    assert mean_metric.compute().result == pytest.approx(0.5)
+
+    strict_metric = QAAccuracyMetric(vlm=mock_vlm, vlm_type="litellm", device="cpu", aggregation="all_or_nothing")
+    strict_metric.update(["a prompt"], aux, images)
+    assert strict_metric.compute().result == pytest.approx(0.0)
+
+
+@pytest.mark.cpu
 def test_get_vlm_returns_custom() -> None:
     custom = MagicMock(spec=BaseVLM)
     out = get_vlm(vlm=custom, vlm_type="litellm", model_name="gpt-4o")
@@ -183,6 +199,19 @@
 
 
 @pytest.mark.cpu
+def test_text_score_uses_normalized_edit_distance() -> None:
+    mock_vlm = MagicMock(spec=BaseVLM)
+    mock_vlm.generate.side_effect = [["abxde"], ["ax"]]
+    metric = TextScoreMetric(vlm=mock_vlm, vlm_type="litellm", device="cpu")
+
+    metric.update(["p1"], ["abcde"], _dummy_image(batch=1))
+    metric.update(["p2"], ["ab"], _dummy_image(batch=1))
+
+    assert metric.scores == pytest.approx([0.2, 0.5])
+    assert metric.compute().result == pytest.approx(0.35)
+
+
+@pytest.mark.cpu
 def test_text_score_registry_aliases() -> None:
     from pruna.evaluation.metrics.registry import MetricRegistry

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 7435679. Configure here.}

cursor · 2026-04-10T13:54:45Z

src/pruna/evaluation/metrics/metric_qa_accuracy.py

+                ["Yes"] * len(questions),
+                response_format=self.response_format,
+            )
+            score = float(np.mean(scores))


aggregation parameter stored but never applied in scoring

High Severity

QAAccuracyMetric accepts and stores self.aggregation (e.g. "all_or_nothing" for GenEval) but never reads it. The update method always uses np.mean(scores) on line 154, giving partial credit to every image. For GenEval, where Task.from_benchmark explicitly passes aggregation="all_or_nothing", this produces inflated scores instead of the official binary pass/fail (1 only if every atomic question passes, 0 otherwise).

Additional Locations (2)

src/pruna/evaluation/task.py#L108-L111

src/pruna/evaluation/benchmark_vlm_integration.py#L128-L130

^{Reviewed by Cursor Bugbot for commit 7435679. Configure here.}

cursor · 2026-04-10T13:54:45Z

src/pruna/data/datasets/prompt.py

+        return True
+    prompt = row.get("prompt")
+    prompt_en = row.get("prompt_en")
+    return bool(prompt and not (isinstance(prompt_en, str) and prompt_en.strip()))


Chinese language heuristic misclassifies EN-only rows

Medium Severity

_oneig_alignment_language_zh falls through to a heuristic on line 146 that returns True (Chinese) when prompt is non-empty but prompt_en is absent or empty. Rows in the EN config (OneIG-Bench) that use prompt as their primary text field without a separate prompt_en column would be misclassified as Chinese, causing _oneig_qd_prefix to select *_zh question-dependency files instead of the English ones.

^{Reviewed by Cursor Bugbot for commit 7435679. Configure here.}

cursor · 2026-04-10T13:54:45Z

src/pruna/evaluation/metrics/metric_text_score.py

+    def _accumulate_sample(self, text_gt: str, ocr_text: str) -> None:
+        norm_gt = normalize_text_simple(text_gt)
+        norm_ocr = normalize_text_simple(ocr_text)
+        self.scores.append(levenshtein(norm_ocr, norm_gt))


Text score uses unnormalized distance favoring short texts

Medium Severity

TextScoreMetric._accumulate_sample stores raw Levenshtein edit distance without normalizing by text length. Since the metric is lower_is_better, longer ground-truth texts inherently produce higher (worse) scores than shorter texts with the same number of errors, making scores incomparable across samples of different length. The benchmark description mentions "mean character error rate" but the implementation returns raw edit counts.

^{Reviewed by Cursor Bugbot for commit 7435679. Configure here.}

llcnt · 2026-04-10T13:56:05Z

The code change (>5K) is too large for me :( Could you point me to some place where my review can be particularly useful ? I can definitely dedicate some time to review some files, but not all 😓

davidberenstein1957 · 2026-04-10T14:03:06Z

Hi @llcnt, if you have time. You can take a look at the LLM2CLIP implementation :)

…GenEval self.aggregation was stored but never read — np.mean was always used. Now all_or_nothing mode returns 1.0 only when every question scores ≥0.5, matching the GenEval all-or-nothing semantics (arXiv:2310.11513). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This commit deletes the `benchmark_vlm_integration.py` file, which contained shared helpers and metrics for VLM benchmark integration. The removal is part of a broader effort to streamline the codebase and eliminate unused components. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Treat score == 0.5 as ambiguous/No in all_or_nothing aggregation and remove redundant local imports from the two existing all_or_nothing tests. Adds a boundary test to confirm the 0.5 edge case is handled correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

davidberenstein1957 · 2026-04-10T14:29:36Z

Code review

Found 2 issues:

qa_accuracy ignores aggregation (including GenEval all_or_nothing). The metric stores aggregation in __init__ but update always uses np.mean(scores) per image and compute always uses np.mean(self.scores), so partial credit remains and the final score is not strict per-image all-or-nothing. This conflicts with the GenEval benchmark text and Task.from_benchmark wiring that pass aggregation=\"all_or_nothing\".

pruna/src/pruna/evaluation/metrics/metric_qa_accuracy.py

Lines 107 to 168 in 7435679

    
                   self.aggregation = kwargs.pop("aggregation", "mean") 
        
               def _extract_questions(self, gt: Any, n: int) -> List[List[str]]: 
        
                   if isinstance(gt, (list, tuple)) and len(gt) >= n: 
        
                       out = [] 
        
                       for i in range(n): 
        
                           v = gt[i] 
        
                           if isinstance(v, dict) and "questions" in v: 
        
                               qs = v["questions"] 
        
                               out.append(list(qs.values()) if isinstance(qs, dict) else list(qs)) 
        
                           else: 
        
                               out.append([]) 
        
                       return out 
        
                   return [[]] * n 
        
               def update(self, x: List[Any] | torch.Tensor, gt: torch.Tensor, outputs: torch.Tensor) -> None: 
        
                   """ 
        
                   Update the metric with new batch data. 
        
                   Parameters 
        
                   ---------- 
        
                   x : List[Any] | torch.Tensor 
        
                       The input data. 
        
                   gt : torch.Tensor 
        
                       The ground truth (questions per image). 
        
                   outputs : torch.Tensor 
        
                       The output images. 
        
                   """ 
        
                   inputs = metric_data_processor(x, gt, outputs, self.call_type) 
        
                   images = _process_images(inputs[0]) 
        
                   auxiliaries = inputs[1] if len(inputs) > 1 else [] 
        
                   questions_per_image = self._extract_questions(auxiliaries, len(images)) 
        
                   for i, image in enumerate(images): 
        
                       questions = questions_per_image[i] if i < len(questions_per_image) else [] 
        
                       if not questions: 
        
                           aux = auxiliaries[i] if i < len(auxiliaries) else {} 
        
                           raise ValueError( 
        
                               "qa_accuracy requires 'questions' in auxiliaries. " 
        
                               "Use a benchmark that provides it (e.g. GenEval, DPG, OneIG). " 
        
                               f"Got aux keys: {list(aux.keys()) if isinstance(aux, dict) else 'not a dict'}." 
        
                           ) 
        
                       scores = self.vlm.score( 
        
                           [image] * len(questions), 
        
                           questions, 
        
                           ["Yes"] * len(questions), 
        
                           response_format=self.response_format, 
        
                       ) 
        
                       score = float(np.mean(scores)) 
        
                       self.scores.append(score) 
        
               def compute(self) -> MetricResult: 
        
                   """ 
        
                   Compute the QA accuracy score. 
        
                   Returns 
        
                   ------- 
        
                   MetricResult 
        
                       The mean QA accuracy across all updates. 
        
                   """ 
        
                   if not self.scores: 
        
                       return MetricResult(self.metric_name, self.__dict__, 0.0) 
        
                   return MetricResult(self.metric_name, self.__dict__, float(np.mean(self.scores)))

pruna/src/pruna/evaluation/benchmarks.py

Lines 219 to 227 in 7435679

    
           Benchmark( 
        
               name="GenEval", 
        
               description=( 
        
                   "Compositional text-to-image benchmark with 6 categories: single object, two object, " 
        
                   "counting, colors, position, color attributes. Uses atomic yes/no questions per prompt; " 
        
                   "``Task.from_benchmark`` wires ``qa_accuracy`` with strict per-image aggregation " 
        
                   "(all questions must pass) plus ``clip_score``. For holistic VQAScore-style scoring " 
        
                   "use GenAI Bench with ``vqa``." 
        
               ),

_extract_questions uses [[]] * n, which aliases the same inner list n times. If the fallback branch runs with n > 1, all indices share one list object (classic Python pitfall); use e.g. [[] for _ in range(n)].

pruna/src/pruna/evaluation/metrics/metric_qa_accuracy.py

Lines 109 to 120 in 7435679

    
           def _extract_questions(self, gt: Any, n: int) -> List[List[str]]: 
        
               if isinstance(gt, (list, tuple)) and len(gt) >= n: 
        
                   out = [] 
        
                   for i in range(n): 
        
                       v = gt[i] 
        
                       if isinstance(v, dict) and "questions" in v: 
        
                           qs = v["questions"] 
        
                           out.append(list(qs.values()) if isinstance(qs, dict) else list(qs)) 
        
                       else: 
        
                           out.append([]) 
        
                   return out 
        
               return [[]] * n

(No CLAUDE.md in repo root or modified directories; review used PR head 7435679005957aaa13413714e7af9aca27938812.)

Generated with Claude Code

davidberenstein1957 · 2026-04-10T14:31:32Z

Update on review #545 (comment)

aggregation / GenEval all_or_nothing — Already implemented on this branch: in QAAccuracyMetric.update, per-image score is 1.0 only when every per-question score passes (all_or_nothing), otherwise mean over question scores for the default. Covered by tests/evaluation/test_vlm_metrics.py (test_qa_accuracy_all_or_nothing_*).
_extract_questions fallback — Fixed: replaced return [[]] * n with return [[] for _ in range(n)] so the fallback path does not alias one empty list across n slots (src/pruna/evaluation/metrics/metric_qa_accuracy.py).

… _parse_score Replaces the private `_parse_score` regex method with the shared `get_score_from_response` utility (already used by ImageEditScoreMetric), which correctly handles FloatOutput pydantic objects, dicts, JSON strings, and plain text. Scales the [0,1] return value back to [0,10] before the geometric mean formula. Also removes the now-unused `re` import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…LM prompt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d BenchmarkRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er=True)

…(match transformers P(Yes))

…etric soft P(Yes)

…erialization Three bugs fixed across the VLM benchmark pipeline: 1. TransformersVLM._generate_standard: model.generate() returns the full input+output sequence; slicing output[0][input_len:] decodes only the new tokens, preventing prompt text from appearing in the VLM response. 2. OneIG Text_Rendering text_content: was using row_class ('PPT generation', 14 chars) as OCR ground truth, making empty-OCR text_score spuriously 0.86. Now extracts all quoted strings from the prompt for Text_Rendering rows, giving a ~236-char ground truth so empty OCR correctly scores 0.0. 3. vlm_benchmark_helpers._safe_json: bytes objects (source_image_bytes for editing benchmarks) fell through to str(obj) producing megabyte-long repr strings in JSON records. Now serialized as {"bytes_len": N}. Also adds: - source_image_bytes field for ImgEdit/GEditBench editing source images - mine-replicate pyproject.toml extra for uv sync --extra mine-replicate - vlm_benchmark_helpers module (shared test/mine logic) - integration record tests for all 10 VLM benchmark jobs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…repr) Regression test for the source_image_bytes serialization fix: ensures that bytes values in auxiliary dicts produce {"bytes_len": N} in JSON records, not the megabyte-long str(bytes) repr. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- run_benchmark_vlm_batch_full: add num_samples param (default 1) that iterates the dataloader N times, calling metric.update() per batch so state accumulates correctly before compute(). - run_benchmark_vlm_multibatch_with_preds: new helper for mine scripts that takes a list of pre-built pred tensors (one per sample), loads N dataset batches, and accumulates all into a single metric instance. Returns aggregated BenchmarkVlmBatchOutcome from one compute() call. This enables proper multi-sample evaluation where each sample exercises a separate update() call, validating stateful metric accumulation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix D407 in vlm_utils.py: add numpy-style dashed underline under Returns section - Fix D301 in vlm_benchmark_helpers.py and test_oneig_alignment.py: use r""" for docstrings with escaped quotes - Delete test_vlm_benchmark_e2e.py (all integration/slow marks, won't run in CI) - Delete test_vlm_benchmark_integration_record.py: merge its 3 unit tests into test_vlm_metrics.py - Delete tests/data/test_oneig_loader.py: dataset-specific loader test file removed per review - Add missing D103 docstrings in test_vlm_metrics.py functions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor bot reviewed Feb 21, 2026

View reviewed changes

davidberenstein1957 changed the title ~~feat(evaluation): add ImageRewardMetric~~ feat(evaluation): add VLMMetrics Feb 21, 2026

davidberenstein1957 requested review from begumcig and simlang March 5, 2026 14:37

davidberenstein1957 commented Mar 5, 2026

View reviewed changes

tests/evaluation/test_task.py Outdated Show resolved Hide resolved

davidberenstein1957 commented Mar 5, 2026

View reviewed changes

pyproject.toml Show resolved Hide resolved

begumcig requested changes Mar 11, 2026

View reviewed changes

davidberenstein1957 force-pushed the feat/metrics-vlm-support branch from 09e3f67 to a045a38 Compare March 19, 2026 18:27

davidberenstein1957 requested a review from begumcig March 24, 2026 20:56

davidberenstein1957 requested a review from llcnt April 1, 2026 15:44

begumcig requested changes Apr 2, 2026

View reviewed changes

davidberenstein1957 force-pushed the feat/metrics-vlm-support branch from f1d0d73 to 33d9135 Compare April 5, 2026 05:08

davidberenstein1957 added 17 commits April 5, 2026 07:43

fix(evaluation): ARNIQA not in torchmetrics - implement manually

89102cd

ARNIQA is not available in torchmetrics 1.7.4. Implementing simplified version with optional pretrained weight loading.

fix(evaluation): use List-based scores pattern matching Pruna standards

c5109a6

- Use scores: List[float] instead of tensor total/count - Add default_call_type and runs_on attributes - Match SharpnessMetric pattern

fix(evaluation): use sync completion instead of async acompletion

a39bca4

The async version was returning a coroutine instead of the actual response, causing all VLM metrics to silently fail.

chore(evaluation): remove ARNIQA from VLM PR - has dedicated PR #547

22da2ee

fix(evaluation): fix linting issues in VLM metrics

5c58c8d

- Add docstrings to update/compute methods - Fix type hints - Add ruff fixes

fix(evaluation): fix remaining linting issues

3df89f4

- Add PIL import at top - Fix type hints - D205 docstring issues are from multi-line examples

fix(evaluation): fix D205 docstring issues in VLM classes

763270b

fix(evaluation): fix import sorting in __init__.py

770cc96

fix(evaluation): skip docstring check for metrics_vlm

c7e6eed

The metrics_vlm module uses a different docstring pattern for VLM parameters that doesn't fit numpydoc's PR01 check. Skip this check for the new VLM metrics.

Delete docs/VLM_METRICS_PROMPT_COMPARISON.md

1cf9fce

davidberenstein1957 added 7 commits April 9, 2026 15:39

style(vendor): ruff fixes for oneig_llm2vec

3610037

Made-with: Cursor

cursor bot reviewed Apr 10, 2026

View reviewed changes

davidberenstein1957 and others added 3 commits April 10, 2026 16:21

davidberenstein1957 and others added 14 commits April 10, 2026 16:33

test: verify VQAScore P(Yes) normalization and SmolVLM yes/no token ids

8ab5a1e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: add grandchild chain test for OneIG dependency masking

1bee09d

test: verify ImgEdit prompt routing — instruction flows from x into V…

6830cc1

…LM prompt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: clarify GEditBench 2-criterion scoring gap in VieScoreMetric an…

ec4bb09

…d BenchmarkRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: add parametrized auxiliary structure validation per benchmark

e2eae6c

test: assert metric results are in [0, 1] range in e2e tests

c987406

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: normalize TextScoreMetric to [0,1] char accuracy (higher_is_bett…

03cf5fd

…er=True)

fix: sum all yes/no prefix token probs in LitellmVLM logprob scoring …

1bb8e73

…(match transformers P(Yes))

docs: clarify AlignmentScoreMetric as binary VQAScore variant vs VQAM…

a6586dd

…etric soft P(Yes)

Conversation

davidberenstein1957 commented Feb 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codacy-production bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llcnt commented Apr 10, 2026

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 10, 2026

Choose a reason for hiding this comment

aggregation parameter stored but never applied in scoring

Uh oh!

cursor bot Apr 10, 2026

Choose a reason for hiding this comment

Chinese language heuristic misclassifies EN-only rows

Uh oh!

cursor bot Apr 10, 2026

Choose a reason for hiding this comment

Text score uses unnormalized distance favoring short texts

Uh oh!

llcnt commented Apr 10, 2026

Uh oh!

davidberenstein1957 commented Apr 10, 2026

Uh oh!

davidberenstein1957 commented Apr 10, 2026

Code review

Uh oh!

davidberenstein1957 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codacy-production bot commented Apr 1, 2026 •

edited

Loading

cursor bot left a comment •

edited

Loading

`aggregation` parameter stored but never applied in scoring