Skip to content

OBSDOCS-3383: Document Vector metrics cardinality and monitoring impact#113290

Open
johnwilkins wants to merge 5 commits into
openshift:standalone-logging-docs-mainfrom
johnwilkins:OBSDOCS-3383
Open

OBSDOCS-3383: Document Vector metrics cardinality and monitoring impact#113290
johnwilkins wants to merge 5 commits into
openshift:standalone-logging-docs-mainfrom
johnwilkins:OBSDOCS-3383

Conversation

@johnwilkins

@johnwilkins johnwilkins commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Documents the issue where Vector collector metrics exhibit high cardinality in multitenant environments, impacting Prometheus resource consumption and cluster stability.

Provides comprehensive guidance covering design, diagnosis, and remediation of high metrics cardinality caused by complex ClusterLogForwarder configurations.

Jira

Problem Addressed

In multitenant OpenShift clusters with many ClusterLogForwarder inputs, outputs, and pipelines, Vector collector metrics create high cardinality in Prometheus:

  • vector_component_received_events_count_bucket: 700K+ time series
  • Impacts Prometheus memory, CPU, and storage
  • Can destabilize monitoring stack
  • Affects cluster operations (upgrades, HPA/VPA)

No existing documentation covered this issue.

Content Created

New Modules (3)

  1. modules/collector-metrics-cardinality-impact.adoc (CONCEPT - 143 lines)

    • Explains what metrics cardinality is
    • How ClusterLogForwarder configuration creates cardinality
    • Which Vector metrics are affected (component_id label)
    • Impact on Prometheus and cluster stability
    • When to be concerned
  2. modules/best-practices-multitenant-logging.adoc (REFERENCE - 312 lines)

    • Design patterns to minimize cardinality impact
    • Consolidate inputs using label selectors
    • Consolidate outputs to same destination
    • Example architectures: 40-60 components vs 400-500 components
    • Cardinality estimation formulas
  3. modules/troubleshooting-collector-metrics-cardinality.adoc (PROCEDURE - 204 lines)

    • Diagnostic steps using promtool and PromQL (from KB 7137995)
    • Identify problematic ClusterLogForwarder instances
    • Remediation options with code examples
    • Verification steps

Assembly Updated

  • configuring/cluster-logging-collector.adoc
    • Added new section: "Collector metrics and monitoring impact"
    • Integrated into existing advanced collector configuration assembly

Content Sources

  • Red Hat KB article 7137995 (diagnostic procedures)
  • OBSDA-1341 RFE description (problem statement)
  • Support case analysis (customer impact)

Validation

  • ✅ Vale: 0 errors, 0 warnings
  • ✅ Cross-references complete
  • ✅ Based on validated KB article procedures

Signed-off-by: John Wilkins jowilkin@redhat.com

Addresses the issue where Vector collector metrics exhibit high
cardinality in multitenant environments, impacting Prometheus
resource consumption and cluster stability.

Creates comprehensive guidance covering design, diagnosis, and
remediation of high metrics cardinality caused by complex
ClusterLogForwarder configurations.

New modules:
- collector-metrics-cardinality-impact.adoc (CONCEPT)
  * Explains what metrics cardinality is
  * How ClusterLogForwarder configuration creates cardinality
  * Which Vector metrics are affected (component_id label)
  * Impact on Prometheus (memory, CPU, storage, queries)
  * When to be concerned about cardinality

- best-practices-multitenant-logging.adoc (REFERENCE)
  * Design patterns to minimize cardinality impact
  * Consolidate inputs using label selectors
  * Consolidate outputs to same destination
  * Minimize pipeline count
  * Example multitenant architecture (40-60 vs 400-500 components)
  * Cardinality estimation formulas

- troubleshooting-collector-metrics-cardinality.adoc (PROCEDURE)
  * Diagnostic steps using promtool and PromQL
  * Identify problematic ClusterLogForwarder instances
  * Remediation options with examples
  * Verification steps

Content based on:
- Red Hat KB article 7137995 (diagnostic procedures)
- OBSDA-1341 RFE description (problem statement)
- Support case analysis (customer impact)

Added new section to configuring/cluster-logging-collector.adoc:
"Collector metrics and monitoring impact"

This documentation addresses 3 JTBDs:
1. Design multitenant logging with monitoring impact awareness
2. Diagnose if logging causes Prometheus issues
3. Remediate high cardinality issues

Related Jira: https://redhat.atlassian.net/browse/OBSDOCS-3383
Related RFE: https://redhat.atlassian.net/browse/OBSDA-1341

Signed-off-by: John Wilkins <jowilkin@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 12, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 12, 2026

Copy link
Copy Markdown

@johnwilkins: This pull request references OBSDOCS-3383 which is a valid jira issue.

Details

In response to this:

Summary

Documents the issue where Vector collector metrics exhibit high cardinality in multitenant environments, impacting Prometheus resource consumption and cluster stability.

Provides comprehensive guidance covering design, diagnosis, and remediation of high metrics cardinality caused by complex ClusterLogForwarder configurations.

Jira

Problem Addressed

In multitenant OpenShift clusters with many ClusterLogForwarder inputs, outputs, and pipelines, Vector collector metrics create high cardinality in Prometheus:

  • vector_component_received_events_count_bucket: 700K+ time series
  • Impacts Prometheus memory, CPU, and storage
  • Can destabilize monitoring stack
  • Affects cluster operations (upgrades, HPA/VPA)

No existing documentation covered this issue.

Content Created

New Modules (3)

  1. modules/collector-metrics-cardinality-impact.adoc (CONCEPT - 143 lines)
  • Explains what metrics cardinality is
  • How ClusterLogForwarder configuration creates cardinality
  • Which Vector metrics are affected (component_id label)
  • Impact on Prometheus and cluster stability
  • When to be concerned
  1. modules/best-practices-multitenant-logging.adoc (REFERENCE - 312 lines)
  • Design patterns to minimize cardinality impact
  • Consolidate inputs using label selectors
  • Consolidate outputs to same destination
  • Example architectures: 40-60 components vs 400-500 components
  • Cardinality estimation formulas
  1. modules/troubleshooting-collector-metrics-cardinality.adoc (PROCEDURE - 204 lines)
  • Diagnostic steps using promtool and PromQL (from KB 7137995)
  • Identify problematic ClusterLogForwarder instances
  • Remediation options with code examples
  • Verification steps

Assembly Updated

  • configuring/cluster-logging-collector.adoc
  • Added new section: "Collector metrics and monitoring impact"
  • Integrated into existing advanced collector configuration assembly

Content Sources

  • Red Hat KB article 7137995 (diagnostic procedures)
  • OBSDA-1341 RFE description (problem statement)
  • Support case analysis (customer impact)

JTBD Impact

Before: Score 2/10

  • No documentation on cardinality issue
  • Customers learn only through support cases
  • No self-service diagnosis or remediation

After: Score 8/10

  • ✅ Design guidance to prevent the issue
  • ✅ Diagnostic procedures to identify the issue
  • ✅ Remediation options to fix the issue

Validation

  • ✅ Vale: 0 errors, 16 warnings (callouts/block titles - acceptable)
  • ✅ Cross-references complete
  • ✅ Based on validated KB article procedures

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 12, 2026
@ocpdocs-previewbot

ocpdocs-previewbot commented Jun 12, 2026

Copy link
Copy Markdown

🤖 Fri Jun 12 23:03:53 - Prow CI generated the docs preview:

https://113290--ocpdocs-pr.netlify.app/openshift-logging/latest/configuring/cluster-logging-collector.html

johnwilkins and others added 4 commits June 12, 2026 13:03
Replaces numbered callouts (<1>, <2>, etc.) with DITA-compliant
approaches:
- Inline explanatory text after code blocks
- Definition lists for key-value explanations
- Bulleted lists for multi-point explanations

Callouts are not supported in DITA and cause build warnings.

Changes:
- best-practices-multitenant-logging.adoc
  * Removed 10 callouts from YAML examples
  * Replaced with inline explanations and bulleted lists

- troubleshooting-collector-metrics-cardinality.adoc
  * Removed 4 callouts from terminal output examples
  * Replaced with inline explanatory sentences

Vale validation: 0 errors, 6 warnings (block titles only)

Signed-off-by: John Wilkins <jowilkin@redhat.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed level 2 subheadings and inappropriate block titles from:
- modules/collector-metrics-cardinality-impact.adoc
- modules/best-practices-multitenant-logging.adoc

Changes:
- Converted level 2 subheadings (==) to inline content within main module text
- Removed inappropriate block titles (.Anti-pattern, .Recommended, etc.)
- Replaced block titles with inline introductory text
- Fixed PascalCase terms (LokiStack, LogQL) with backticks

DITA compliance verified with vale-check-assembly - 0 errors, 0 warnings on these modules.

Signed-off-by: John Wilkins <jowilkin@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…es module

Removed standalone '+' continuation markers that were causing visible
plus signs in HTML output. These markers were left over from removed
block titles and are not needed for standalone paragraphs after code blocks.

Signed-off-by: John Wilkins <jowilkin@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed RedHat.Using warnings:
- modules/collector-metrics-cardinality-impact.adoc:101
- modules/troubleshooting-collector-metrics-cardinality.adoc:153

All three modules now have 0 errors, 0 warnings.

Signed-off-by: John Wilkins <jowilkin@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

@johnwilkins: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants