Skip to content

OTel collector metric ccp_stat_checkpointer_sync_time missing value_type: double -> "Error scraping metrics" every collection interval #4514

@Korel

Description

@Korel

Overview

When the OpenTelemetry collector is enabled (spec.instrumentation, OpenTelemetryMetrics feature gate on) against a PostgreSQL 17 cluster, the sqlquery metrics receiver fails to emit ccp_stat_checkpointer_sync_time and logs an "Error scraping metrics" error on every collection interval (every 5s by default).

Deeper analysis with claude revealed that, in the bundled checkpointer query definitions, the two double precision time columns are tagged inconsistently. write_time is correctly declared value_type: double, but sync_time has no value_type. The sqlquery receiver's value_type defaults to int (receiver README), so sync_time is parsed with strconv.Atoi. Once cumulative sync_time (milliseconds) grows large enough that the driver renders it in scientific notation (e.g. 2.774625e+06), Atoi fails.

Both pg_stat_checkpointer.write_time and pg_stat_checkpointer.sync_time are double precision (milliseconds) per the PostgreSQL 17 docs, so both should be value_type: double.

Source (both PG-version variants affected):

internal/collector/gte_pg17_fast_metrics.yaml#L52-L53:

      - metric_name: ccp_stat_checkpointer_write_time
        value_column: write_time
        value_type: double          # tagged correctly
...
      - metric_name: ccp_stat_checkpointer_sync_time
        value_column: sync_time
                                    # value_type: double missing

internal/collector/lt_pg17_fast_metrics.yaml#L51-L52 has the same omission. The generated artifact reflects the defect as well (value_type present on write_time, absent on sync_time): internal/collector/generated/gte_pg17_fast_metrics.json.

Environment

Please provide the following details:

  • Platform: Azure Kubernetes Service
  • Platform Version: 1.32.10
  • PGO Image Tag: ubi9-5.8.5-0 (defect confirmed present in source on v5.8.5, v5.8.8, v6.0.2, and main as of 2026-06-09; likely all 5.8.x / 6.0.x)
  • Postgres Version: 17
  • Storage: disk.csi.azure.com, StandardSSD_LRS SKU
  • Collector: otelcol-contrib 0.139.0 (bundled in crunchydata/postgres-operator:ubi9-5.8.5-0)
  • Feature gate: OpenTelemetryMetrics enabled

Steps to Reproduce

REPRO

  1. Enable the OpenTelemetryMetrics feature gate on the operator.
  2. Create a PostgreSQL 17 PostgresCluster with spec.instrumentation set so the collector sidecar scrapes the bundled metrics queries.
  3. Run the cluster long enough (or generate enough checkpoint activity) for the cumulative pg_stat_checkpointer.sync_time value to grow large enough that the database driver renders it in scientific notation (e.g. 2.774625e+06).
  4. Check the collector container logs.

EXPECTED

ccp_stat_checkpointer_sync_time is emitted as a double, like its sibling ccp_stat_checkpointer_write_time, with no scrape error.

ACTUAL

The sync_time metric fails to emit, and the collector logs an "Error scraping metrics" error at every collection interval (every 5s by default).

Logs

Below error is repeating

error    scraperhelper@v0.139.0/obs_metrics.go:61    Error scraping metrics
{"otelcol.component.id": "sqlquery/5s", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics",
 "error": "row 0: rowToMetric: setDataPointValue: col \"sync_time\": error converting to integer:
  strconv.Atoi: parsing \"2.774625e+06\": invalid syntax"}

Additional Information

As mentioned in overview, some analysis is done by Claude. As a workaround, I'm currently removing the query by using below configuration:

spec:
  instrumentation:
    metrics:
      customQueries:
        remove:
          - ccp_stat_checkpointer_sync_time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions