From 1827b2bb9f0a070d29ee26204a5ce9ca74b4fe23 Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Tue, 16 Jun 2026 11:21:51 -0400 Subject: [PATCH] Fix low-disk alert re-firing every cooldown for standing full volumes (#754) The Volume Free Space alert treated free space as a level: while a volume stayed below threshold it re-fired every AlertCooldownMinutes (default 5) in both apps. Each cycle also wrote a fresh Alert-History row, so dismissing the alert appeared broken -- the dismissed row was replaced by an identical, newer one on the next cycle. Add a shared LowDiskAlertGate (PerformanceMonitor.Notifications) that fires only on a fresh breach or one that has worsened >= 1 percentage point of free space below the last-alerted level, and clear the per-server watermark when the volume recovers. Mirrors the failed-job watermark and the #1091 RollingCountAlertGate edge trigger. Wired identically into Dashboard and Lite. Covered by LowDiskAlertGateTests (9 cases). Both apps build clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 1 + Dashboard/MainWindow.xaml.cs | 14 ++++- Lite.Tests/LowDiskAlertGateTests.cs | 61 ++++++++++++++++++ Lite/MainWindow.xaml.cs | 14 ++++- .../LowDiskAlertGate.cs | 63 +++++++++++++++++++ 5 files changed, 151 insertions(+), 2 deletions(-) create mode 100644 Lite.Tests/LowDiskAlertGateTests.cs create mode 100644 PerformanceMonitor.Notifications/LowDiskAlertGate.cs diff --git a/CHANGELOG.md b/CHANGELOG.md index 5f523d85..c5ae05ad 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Lite and Dashboard: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database ()` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Normalized at every edition display/storage site across **both** apps — the live inventory queries (Lite + Dashboard) and the SQL-side collectors (`install/42`, `install/53`) plus Lite's `server_properties` collector — so the value is consistent app-wide; on-prem editions are unchanged. (The licensing-recommendation queries are left raw and identical in both apps: they only do an `Enterprise` substring check and never display the edition for Azure.) - **Dashboard: "Deadlocks Cleared" no longer flaps right after every deadlock** ([#1091]) — deadlock detection is edge-triggered off a delta against the cumulative perfmon counter, so the check immediately after a deadlock saw a zero delta and fired a *"Deadlocks Cleared"* notification ~one interval (≈60s) after every *"Deadlock Detected"*. The alert now stays active and clears only once a deadlock-quiet window (1 hour) has elapsed since the last new deadlock, so the detect/clear pair lines up with Lite, whose rolling 1-hour count drains about an hour after the last deadlock. Each new deadlock resets the window. The clear message is now *"No deadlocks in the last hour"* (was *"No deadlocks since last check"*). Covered by `DeadlockAlertClearPolicyTests` - **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests` +- **Lite and Dashboard: low-disk (Volume Free Space) alert no longer re-fires every cooldown for a standing full volume** — a breached volume is a sustained condition, but the alert engine treated free space as a level and re-fired every `AlertCooldownMinutes` (default 5) for as long as the volume stayed below threshold. Besides the repeated tray/email, every cycle wrote a fresh Alert-History row, so dismissing the alert appeared not to work — the dismissed row was immediately replaced by an identical, newer one. The alert is now gated by a shared `LowDiskAlertGate` that notifies only on a **fresh** breach or one that has **worsened** by at least 1 percentage point of free space below the last-alerted level, and clears its watermark when the volume recovers — mirroring the failed-job watermark and the [#1091] rolling-count edge trigger. Fixed identically in both apps. Covered by `LowDiskAlertGateTests` - **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover - **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency - **Blocked-process and deadlock XML processors no longer loop on un-parseable events** — the second-phase parsers (`collect.process_blocked_process_xml` → `sp_HumanEventsBlockViewer`, and `collect.process_deadlock_xml` → `sp_BlitzLock`) only marked a captured event processed when the parse produced at least one row. Events that legitimately yield zero — a *self-block* or non-lock wait (e.g. a memory-grant `RESOURCE_SEMAPHORE` wait that tripped `blocked_process_threshold`, which SQL Server reports as a session blocked by itself), or a deadlock graph the parser can't reconstruct — were never marked, so every collection cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual `NO_RESULTS` while the staging table never drained. Both processors now mark events processed after any clean parse run and log `SUCCESS`; genuine parse failures still roll back and retry. Separately, the blocked-process processor's parse window was half-open (`event_time < @end_date`), so a batch of reports sharing one timestamp — the common case, since a blocked-process monitor loop emits every report at a single instant — fell outside `[MIN, MAX)` and was silently dropped; the upper bound is now inclusive (matching the deadlock processor). Covered by `tools/test_blocked_process_processor.sql` using real self-block and two-session samples diff --git a/Dashboard/MainWindow.xaml.cs b/Dashboard/MainWindow.xaml.cs index 251ab435..2c095671 100644 --- a/Dashboard/MainWindow.xaml.cs +++ b/Dashboard/MainWindow.xaml.cs @@ -99,6 +99,11 @@ public partial class MainWindow : Window private readonly ConcurrentDictionary _activeTempDbSpaceAlert = new(); private readonly ConcurrentDictionary _lastLowDiskAlert = new(); private readonly ConcurrentDictionary _activeLowDiskAlert = new(); + /* Worst free-% captured at the last low-disk alert per server (#754 follow-up): a full + volume is a standing condition, so without this the alert re-fired — and re-recorded an + alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate to notify + only on a fresh or worsening breach; removed when the condition resolves. */ + private readonly ConcurrentDictionary _lastAlertedLowDiskPercent = new(); private readonly ConcurrentDictionary _lastLongRunningJobAlert = new(); private readonly ConcurrentDictionary _activeLongRunningJobAlert = new(); private readonly ConcurrentDictionary _lastFailedJobAlert = new(); @@ -1979,11 +1984,17 @@ await _emailAlertService.TrySendAlertEmailAsync( { var worst = breachedVolumes[0]; _activeLowDiskAlert[serverId] = true; - if (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown) + double? lastLowDiskPercent = + _lastAlertedLowDiskPercent.TryGetValue(serverId, out var lowDiskPct) ? lowDiskPct : (double?)null; + /* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a + standing full volume (which also re-recorded a history row and made Dismiss feel broken). */ + if (LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent) + && (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown)) { var muteCtx = new AlertMuteContext { ServerName = serverName, MetricName = "Volume Free Space" }; bool isMuted = _muteRuleService.IsAlertMuted(muteCtx); _lastLowDiskAlert[serverId] = now; + _lastAlertedLowDiskPercent[serverId] = worst.FreePercent; var lowDiskContext = BuildVolumeFreeSpaceContext(breachedVolumes); var detailText = ContextToDetailText(lowDiskContext); var currentValue = $"{worst.MountPoint} {worst.FreePercent:F0}% free ({worst.FreeGb:F1} GB)"; @@ -2017,6 +2028,7 @@ await _emailAlertService.TrySendAlertEmailAsync( } else if (_activeLowDiskAlert.TryRemove(serverId, out var wasLowDisk) && wasLowDisk) { + _lastAlertedLowDiskPercent.TryRemove(serverId, out _); _notificationService?.ShowNotification("Volume Free Space Resolved", $"{serverName}: All volumes back above threshold"); _emailAlertService.RecordAlert(serverId, serverName, "Volume Free Space Resolved", diff --git a/Lite.Tests/LowDiskAlertGateTests.cs b/Lite.Tests/LowDiskAlertGateTests.cs new file mode 100644 index 00000000..cd316862 --- /dev/null +++ b/Lite.Tests/LowDiskAlertGateTests.cs @@ -0,0 +1,61 @@ +using PerformanceMonitor.Notifications; +using Xunit; + +namespace PerformanceMonitorLite.Tests; + +/// +/// Guards (#754 follow-up): a standing full volume must not +/// re-alert every cooldown — only a fresh or meaningfully-worsening breach fires. Without this, +/// the low-disk alert re-recorded an identical alert-history row every cooldown, which made +/// Alert-History "Dismiss" appear broken (the dismissed row was replaced by a newer one). The +/// gate is shared by Lite and Dashboard, so this one suite covers both apps' behaviour. +/// +public class LowDiskAlertGateTests +{ + [Fact] + public void FreshBreach_NoWatermark_Alerts() + { + Assert.True(LowDiskAlertGate.ShouldAlert(currentWorstFreePercent: 8.0, lastAlertedFreePercent: null)); + } + + [Fact] + public void UnchangedLevel_DoesNotReAlert() + { + Assert.False(LowDiskAlertGate.ShouldAlert(8.0, 8.0)); + } + + [Fact] + public void SlightlyImproved_DoesNotReAlert() + { + Assert.False(LowDiskAlertGate.ShouldAlert(8.4, 8.0)); + } + + [Fact] + public void WorsenedWithinJitterMargin_DoesNotReAlert() + { + // Dropped 0.5pp (< 1.0 default margin) — jitter, not a genuine decline. + Assert.False(LowDiskAlertGate.ShouldAlert(7.5, 8.0)); + } + + [Fact] + public void WorsenedExactlyMargin_ReAlerts() + { + // Dropped exactly the 1.0pp default margin. + Assert.True(LowDiskAlertGate.ShouldAlert(7.0, 8.0)); + } + + [Fact] + public void WorsenedBeyondMargin_ReAlerts() + { + Assert.True(LowDiskAlertGate.ShouldAlert(5.5, 8.0)); + } + + [Theory] + [InlineData(8.0, 8.0, 2.0, false)] // unchanged, custom larger margin + [InlineData(6.0, 8.0, 2.0, true)] // dropped exactly the custom margin + [InlineData(6.5, 8.0, 2.0, false)] // dropped 1.5pp (< 2.0 custom margin) + public void RespectsCustomMargin(double current, double last, double margin, bool expected) + { + Assert.Equal(expected, LowDiskAlertGate.ShouldAlert(current, last, margin)); + } +} diff --git a/Lite/MainWindow.xaml.cs b/Lite/MainWindow.xaml.cs index e16a344d..0a49ade9 100644 --- a/Lite/MainWindow.xaml.cs +++ b/Lite/MainWindow.xaml.cs @@ -69,6 +69,10 @@ public partial class MainWindow : Window private readonly Dictionary _activeLongRunningQueryAlert = new(); private readonly Dictionary _activeTempDbSpaceAlert = new(); private readonly Dictionary _activeLowDiskAlert = new(); + /* Worst free-% captured at the last low-disk alert per server (#754 follow-up): see the + Dashboard counterpart. Without it a standing full volume re-fired — and re-recorded an + alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate; removed on resolve. */ + private readonly Dictionary _lastAlertedLowDiskPercent = new(); private readonly Dictionary _activeLongRunningJobAlert = new(); private readonly Dictionary _lastFailedJobAlert = new(); /* Watermark of the most-recent failed-job run time already alerted per server. A failed run @@ -1919,11 +1923,18 @@ await _emailAlertService.TrySendAlertEmailAsync( { var worst = breached[0]; _activeLowDiskAlert[key] = true; - if (!suppressPopups && (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown)) + double? lastLowDiskPercent = + _lastAlertedLowDiskPercent.TryGetValue(key, out var lowDiskPct) ? lowDiskPct : (double?)null; + /* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a + standing full volume (which also re-recorded a history row and made Dismiss feel broken). */ + if (!suppressPopups + && LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent) + && (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown)) { var muteCtx = new AlertMuteContext { ServerName = summary.DisplayName, MetricName = "Volume Free Space" }; bool isMuted = _muteRuleService.IsAlertMuted(muteCtx); _lastLowDiskAlert[key] = now; + _lastAlertedLowDiskPercent[key] = worst.FreePercent; if (!isMuted) { @@ -1955,6 +1966,7 @@ await _emailAlertService.TrySendAlertEmailAsync( else if (_activeLowDiskAlert.TryGetValue(key, out var wasLowDisk) && wasLowDisk) { _activeLowDiskAlert[key] = false; + _lastAlertedLowDiskPercent.Remove(key); if (!suppressPopups) { _trayService.ShowNotification( diff --git a/PerformanceMonitor.Notifications/LowDiskAlertGate.cs b/PerformanceMonitor.Notifications/LowDiskAlertGate.cs new file mode 100644 index 00000000..797d9da0 --- /dev/null +++ b/PerformanceMonitor.Notifications/LowDiskAlertGate.cs @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2026 Erik Darling, Darling Data LLC + * + * This file is part of the SQL Server Performance Monitor. + * + * Licensed under the MIT License. See LICENSE file in the project root for full license information. + */ + +namespace PerformanceMonitor.Notifications; + +/// +/// Decides whether a low-disk ("Volume Free Space") alert should (re-)fire for a server, so a +/// standing breach does not re-notify — and re-record an alert-history row — every cooldown. +/// +/// A full volume is a sustained condition (disk does not free itself), so a plain per-cooldown +/// level check fires the alert every AlertCooldownMinutes for as long as the breach lasts. +/// Besides the repeated tray/email, that wrote a fresh alert-history row every cycle, which made +/// Alert-History "Dismiss" feel broken: each dismissed row was immediately replaced by an +/// identical, newer one. This mirrors the failed-job watermark (_lastAlertedFailedJobTime) +/// and the rolling-count edge trigger (RollingCountAlertGate, #1091): notify on a NEW or +/// WORSENING breach, stay quiet at an unchanged level. Shared by Lite and Dashboard so the rule +/// cannot drift between the two apps. +/// +/// +public static class LowDiskAlertGate +{ + /// + /// Minimum drop, in free-space percentage points below the last-alerted level, required to + /// re-alert. Keeps normal free-space jitter (logs / tempdb growing and shrinking a fraction + /// of a percent) from re-tripping the alert while still catching a genuine decline. + /// + public const double DefaultWorseningMarginPercent = 1.0; + + /// + /// Returns true when a low-disk alert should fire this cycle. + /// + /// + /// Free-space percentage of the worst (lowest-free) breached volume this cycle. + /// + /// + /// Free percentage captured when the alert last fired for this server, or null when + /// there is no active low-disk alert (a fresh breach — the caller clears its watermark when + /// the condition resolves). + /// + /// + /// Required worsening, in percentage points; defaults to . + /// + public static bool ShouldAlert( + double currentWorstFreePercent, + double? lastAlertedFreePercent, + double worseningMarginPercent = DefaultWorseningMarginPercent) + { + /* Fresh breach (no active alert) always notifies. */ + if (lastAlertedFreePercent is null) + { + return true; + } + + /* Otherwise only when free space has fallen at least the margin below the last-alerted + level — i.e. the breach got meaningfully worse. */ + return currentWorstFreePercent <= lastAlertedFreePercent.Value - worseningMarginPercent; + } +}