Skip to content

[addon-operator] add queue head info metric and critical flag to module info#771

Draft
diyliv wants to merge 3 commits into
mainfrom
feature/queue-head-info-metric
Draft

[addon-operator] add queue head info metric and critical flag to module info#771
diyliv wants to merge 3 commits into
mainfrom
feature/queue-head-info-metric

Conversation

@diyliv
Copy link
Copy Markdown
Contributor

@diyliv diyliv commented Jun 2, 2026

What this PR does

Adds two metrics that let us replace the flat D8DeckhouseQueueIsHung alert with severity-differentiated alerts.

New metric: tasks_queue_head_info

A gauge (value=1) with labels queue, module, task_type, hook. Published every 5 seconds for each non-empty queue. Old series are expired when the head changes -> no phantom metrics remain.

Label cleanup:

  • ParallelModuleRun synthetic names like "Parallel run for a, b, c" -> normalized to empty string (would otherwise produce a bad join with deckhouse_mm_module_info)
  • Global tasks (ConvergeModules, GlobalHookRun, DiscoverHelmReleases, ApplyKubeConfigValues) -> module is empty, which is correct since these are not module-specific

New label: critical on deckhouse_mm_module_info

Value "true" or "false" from BasicModule.GetCritical() (the critical: true property in module.yaml). Added additively -> existing queries are unaffected.

Why it's needed

The old D8DeckhouseQueueIsHung alert had two problems:

  • No way to see what's stuck -> only the queue name was visible, not the module, task type, or hook
  • Same severity for everything -> all hung queues alerted at severity 7 regardless of how critical the module was

With these two metrics, we can create three separate alerts:

Alert Severity Triggers for
D8DeckhouseQueueIsHungCritical 4 critical="true" modules
D8DeckhouseQueueIsHung 6 critical="false" modules
D8DeckhouseQueueIsHungGlobal 4 global tasks (module="")

@diyliv diyliv self-assigned this Jun 2, 2026
@diyliv diyliv marked this pull request as draft June 2, 2026 17:03
@diyliv diyliv changed the title add queue head info metric and critical flag to module info [addon-operator] add queue head info metric and critical flag to module info Jun 2, 2026
@diyliv diyliv force-pushed the feature/queue-head-info-metric branch from 2b91642 to 14ed834 Compare June 2, 2026 17:11
diyliv added 2 commits June 3, 2026 16:58
Signed-off-by: diyliv <onlogn081@gmail.com>
Signed-off-by: diyliv <onlogn081@gmail.com>
@diyliv diyliv force-pushed the feature/queue-head-info-metric branch from 14ed834 to 580232f Compare June 3, 2026 13:59
@diyliv diyliv added release-note/enhancement New feature or request publish/image/dev Build and push dev image using PR number as docker tag labels Jun 3, 2026
@github-actions github-actions Bot removed the publish/image/dev Build and push dev image using PR number as docker tag label Jun 3, 2026
@ldmonster ldmonster requested a review from Copilot June 5, 2026 13:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances addon-operator observability for “hung queue” alerting by adding a new “queue head” metric (to show what’s actually stuck) and extending the module info metric with a critical label (to enable severity-differentiated alerts based on module criticality).

Changes:

  • Add tasks_queue_head_info gauge metric (published every 5s for non-empty queues) with labels: queue, module, task_type, hook, expiring old series when the head changes.
  • Extend deckhouse_mm_module_info (mm_module_info) metric with an additive critical={"true"|"false"} label derived from BasicModule.GetCritical().
  • Wire the new queue-head extraction into bootstrap and add unit tests for head-info publication/expiration behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/module_manager/module_manager.go Adds critical label to module info metric series.
pkg/metrics/metrics.go Introduces tasks_queue_head_info metric and publishes it alongside queue length updates.
pkg/metrics/metrics_test.go Adds tests for queue head info metric creation, normalization, and expiration.
pkg/addon-operator/bootstrap.go Provides a metadata extractor for deriving (module, hook) for the new head-info metric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +514 to +517
critical := "false"
if bm := mm.GetModule(module); bm != nil && bm.GetCritical() {
critical = "true"
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread pkg/metrics/metrics.go
Comment on lines +618 to +639
func updateTasksQueueHeadInfo(tqs *queue.TaskQueueSet, metricStorage metricsstorage.Storage, headInfoExtractor func(metadata interface{}) (module, hook string)) {
metricStorage.Grouped().ExpireGroupMetricByName("tasks_queue_head_info", TasksQueueHeadInfo)

tqs.IterateSnapshot(context.TODO(), func(_ context.Context, q *queue.TaskQueue) {
t := q.GetFirst()
if t == nil {
return
}

module, hook := headInfoExtractor(t.GetMetadata())

// Normalize ParallelModuleRun synthetic module names:
// "Parallel run for a, b, c" -> "" to avoid false joins with deckhouse_mm_module_info.
if strings.HasPrefix(module, "Parallel run for ") {
module = ""
}

metricStorage.Grouped().GaugeSet(
"tasks_queue_head_info",
TasksQueueHeadInfo,
1,
map[string]string{
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: diyliv <onlogn081@gmail.com>
@diyliv diyliv added enhancement New feature or request go Pull requests that update Go code and removed release-note/enhancement New feature or request labels Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants