From 1827b2bb9f0a070d29ee26204a5ce9ca74b4fe23 Mon Sep 17 00:00:00 2001
From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com>
Date: Tue, 16 Jun 2026 11:21:51 -0400
Subject: [PATCH] Fix low-disk alert re-firing every cooldown for standing full
 volumes (#754)

The Volume Free Space alert treated free space as a level: while a volume
stayed below threshold it re-fired every AlertCooldownMinutes (default 5) in
both apps. Each cycle also wrote a fresh Alert-History row, so dismissing the
alert appeared broken -- the dismissed row was replaced by an identical, newer
one on the next cycle.

Add a shared LowDiskAlertGate (PerformanceMonitor.Notifications) that fires only
on a fresh breach or one that has worsened >= 1 percentage point of free space
below the last-alerted level, and clear the per-server watermark when the volume
recovers. Mirrors the failed-job watermark and the #1091 RollingCountAlertGate
edge trigger. Wired identically into Dashboard and Lite.

Covered by LowDiskAlertGateTests (9 cases). Both apps build clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |  1 +
 Dashboard/MainWindow.xaml.cs                  | 14 ++++-
 Lite.Tests/LowDiskAlertGateTests.cs           | 61 ++++++++++++++++++
 Lite/MainWindow.xaml.cs                       | 14 ++++-
 .../LowDiskAlertGate.cs                       | 63 +++++++++++++++++++
 5 files changed, 151 insertions(+), 2 deletions(-)
 create mode 100644 Lite.Tests/LowDiskAlertGateTests.cs
 create mode 100644 PerformanceMonitor.Notifications/LowDiskAlertGate.cs
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5f523d85..c5ae05ad 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Lite and Dashboard: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database (<service tier>)` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Normalized at every edition display/storage site across **both** apps — the live inventory queries (Lite + Dashboard) and the SQL-side collectors (`install/42`, `install/53`) plus Lite's `server_properties` collector — so the value is consistent app-wide; on-prem editions are unchanged. (The licensing-recommendation queries are left raw and identical in both apps: they only do an `Enterprise` substring check and never display the edition for Azure.)
 - **Dashboard: "Deadlocks Cleared" no longer flaps right after every deadlock** ([#1091]) — deadlock detection is edge-triggered off a delta against the cumulative perfmon counter, so the check immediately after a deadlock saw a zero delta and fired a *"Deadlocks Cleared"* notification ~one interval (≈60s) after every *"Deadlock Detected"*. The alert now stays active and clears only once a deadlock-quiet window (1 hour) has elapsed since the last new deadlock, so the detect/clear pair lines up with Lite, whose rolling 1-hour count drains about an hour after the last deadlock. Each new deadlock resets the window. The clear message is now *"No deadlocks in the last hour"* (was *"No deadlocks since last check"*). Covered by `DeadlockAlertClearPolicyTests`
 - **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests`
+- **Lite and Dashboard: low-disk (Volume Free Space) alert no longer re-fires every cooldown for a standing full volume** — a breached volume is a sustained condition, but the alert engine treated free space as a level and re-fired every `AlertCooldownMinutes` (default 5) for as long as the volume stayed below threshold. Besides the repeated tray/email, every cycle wrote a fresh Alert-History row, so dismissing the alert appeared not to work — the dismissed row was immediately replaced by an identical, newer one. The alert is now gated by a shared `LowDiskAlertGate` that notifies only on a **fresh** breach or one that has **worsened** by at least 1 percentage point of free space below the last-alerted level, and clears its watermark when the volume recovers — mirroring the failed-job watermark and the [#1091] rolling-count edge trigger. Fixed identically in both apps. Covered by `LowDiskAlertGateTests`
 - **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover
 - **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency
 - **Blocked-process and deadlock XML processors no longer loop on un-parseable events** — the second-phase parsers (`collect.process_blocked_process_xml` → `sp_HumanEventsBlockViewer`, and `collect.process_deadlock_xml` → `sp_BlitzLock`) only marked a captured event processed when the parse produced at least one row. Events that legitimately yield zero — a *self-block* or non-lock wait (e.g. a memory-grant `RESOURCE_SEMAPHORE` wait that tripped `blocked_process_threshold`, which SQL Server reports as a session blocked by itself), or a deadlock graph the parser can't reconstruct — were never marked, so every collection cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual `NO_RESULTS` while the staging table never drained. Both processors now mark events processed after any clean parse run and log `SUCCESS`; genuine parse failures still roll back and retry. Separately, the blocked-process processor's parse window was half-open (`event_time < @end_date`), so a batch of reports sharing one timestamp — the common case, since a blocked-process monitor loop emits every report at a single instant — fell outside `[MIN, MAX)` and was silently dropped; the upper bound is now inclusive (matching the deadlock processor). Covered by `tools/test_blocked_process_processor.sql` using real self-block and two-session samples
diff --git a/Dashboard/MainWindow.xaml.cs b/Dashboard/MainWindow.xaml.cs
index 251ab435..2c095671 100644
--- a/Dashboard/MainWindow.xaml.cs
+++ b/Dashboard/MainWindow.xaml.cs
@@ -99,6 +99,11 @@ public partial class MainWindow : Window
         private readonly ConcurrentDictionary<string, bool> _activeTempDbSpaceAlert = new();
         private readonly ConcurrentDictionary<string, DateTime> _lastLowDiskAlert = new();
         private readonly ConcurrentDictionary<string, bool> _activeLowDiskAlert = new();
+        /* Worst free-% captured at the last low-disk alert per server (#754 follow-up): a full
+           volume is a standing condition, so without this the alert re-fired — and re-recorded an
+           alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate to notify
+           only on a fresh or worsening breach; removed when the condition resolves. */
+        private readonly ConcurrentDictionary<string, double> _lastAlertedLowDiskPercent = new();
         private readonly ConcurrentDictionary<string, DateTime> _lastLongRunningJobAlert = new();
         private readonly ConcurrentDictionary<string, bool> _activeLongRunningJobAlert = new();
         private readonly ConcurrentDictionary<string, DateTime> _lastFailedJobAlert = new();
@@ -1979,11 +1984,17 @@ await _emailAlertService.TrySendAlertEmailAsync(
             {
                 var worst = breachedVolumes[0];
                 _activeLowDiskAlert[serverId] = true;
-                if (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown)
+                double? lastLowDiskPercent =
+                    _lastAlertedLowDiskPercent.TryGetValue(serverId, out var lowDiskPct) ? lowDiskPct : (double?)null;
+                /* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a
+                   standing full volume (which also re-recorded a history row and made Dismiss feel broken). */
+                if (LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent)
+                    && (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown))
                 {
                     var muteCtx = new AlertMuteContext { ServerName = serverName, MetricName = "Volume Free Space" };
                     bool isMuted = _muteRuleService.IsAlertMuted(muteCtx);
                     _lastLowDiskAlert[serverId] = now;
+                    _lastAlertedLowDiskPercent[serverId] = worst.FreePercent;
                     var lowDiskContext = BuildVolumeFreeSpaceContext(breachedVolumes);
                     var detailText = ContextToDetailText(lowDiskContext);
                     var currentValue = $"{worst.MountPoint} {worst.FreePercent:F0}% free ({worst.FreeGb:F1} GB)";
@@ -2017,6 +2028,7 @@ await _emailAlertService.TrySendAlertEmailAsync(
             }
             else if (_activeLowDiskAlert.TryRemove(serverId, out var wasLowDisk) && wasLowDisk)
             {
+                _lastAlertedLowDiskPercent.TryRemove(serverId, out _);
                 _notificationService?.ShowNotification("Volume Free Space Resolved",
                     $"{serverName}: All volumes back above threshold");
                 _emailAlertService.RecordAlert(serverId, serverName, "Volume Free Space Resolved",
diff --git a/Lite.Tests/LowDiskAlertGateTests.cs b/Lite.Tests/LowDiskAlertGateTests.cs
new file mode 100644
index 00000000..cd316862
--- /dev/null
+++ b/Lite.Tests/LowDiskAlertGateTests.cs
@@ -0,0 +1,61 @@
+using PerformanceMonitor.Notifications;
+using Xunit;
+
+namespace PerformanceMonitorLite.Tests;
+
+/// <summary>
+/// Guards <see cref="LowDiskAlertGate"/> (#754 follow-up): a standing full volume must not
+/// re-alert every cooldown — only a fresh or meaningfully-worsening breach fires. Without this,
+/// the low-disk alert re-recorded an identical alert-history row every cooldown, which made
+/// Alert-History "Dismiss" appear broken (the dismissed row was replaced by a newer one). The
+/// gate is shared by Lite and Dashboard, so this one suite covers both apps' behaviour.
+/// </summary>
+public class LowDiskAlertGateTests
+{
+    [Fact]
+    public void FreshBreach_NoWatermark_Alerts()
+    {
+        Assert.True(LowDiskAlertGate.ShouldAlert(currentWorstFreePercent: 8.0, lastAlertedFreePercent: null));
+    }
+
+    [Fact]
+    public void UnchangedLevel_DoesNotReAlert()
+    {
+        Assert.False(LowDiskAlertGate.ShouldAlert(8.0, 8.0));
+    }
+
+    [Fact]
+    public void SlightlyImproved_DoesNotReAlert()
+    {
+        Assert.False(LowDiskAlertGate.ShouldAlert(8.4, 8.0));
+    }
+
+    [Fact]
+    public void WorsenedWithinJitterMargin_DoesNotReAlert()
+    {
+        // Dropped 0.5pp (< 1.0 default margin) — jitter, not a genuine decline.
+        Assert.False(LowDiskAlertGate.ShouldAlert(7.5, 8.0));
+    }
+
+    [Fact]
+    public void WorsenedExactlyMargin_ReAlerts()
+    {
+        // Dropped exactly the 1.0pp default margin.
+        Assert.True(LowDiskAlertGate.ShouldAlert(7.0, 8.0));
+    }
+
+    [Fact]
+    public void WorsenedBeyondMargin_ReAlerts()
+    {
+        Assert.True(LowDiskAlertGate.ShouldAlert(5.5, 8.0));
+    }
+
+    [Theory]
+    [InlineData(8.0, 8.0, 2.0, false)] // unchanged, custom larger margin
+    [InlineData(6.0, 8.0, 2.0, true)]  // dropped exactly the custom margin
+    [InlineData(6.5, 8.0, 2.0, false)] // dropped 1.5pp (< 2.0 custom margin)
+    public void RespectsCustomMargin(double current, double last, double margin, bool expected)
+    {
+        Assert.Equal(expected, LowDiskAlertGate.ShouldAlert(current, last, margin));
+    }
+}
diff --git a/Lite/MainWindow.xaml.cs b/Lite/MainWindow.xaml.cs
index e16a344d..0a49ade9 100644
--- a/Lite/MainWindow.xaml.cs
+++ b/Lite/MainWindow.xaml.cs
@@ -69,6 +69,10 @@ public partial class MainWindow : Window
     private readonly Dictionary<string, bool> _activeLongRunningQueryAlert = new();
     private readonly Dictionary<string, bool> _activeTempDbSpaceAlert = new();
     private readonly Dictionary<string, bool> _activeLowDiskAlert = new();
+    /* Worst free-% captured at the last low-disk alert per server (#754 follow-up): see the
+       Dashboard counterpart. Without it a standing full volume re-fired — and re-recorded an
+       alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate; removed on resolve. */
+    private readonly Dictionary<string, double> _lastAlertedLowDiskPercent = new();
     private readonly Dictionary<string, bool> _activeLongRunningJobAlert = new();
     private readonly Dictionary<string, DateTime> _lastFailedJobAlert = new();
     /* Watermark of the most-recent failed-job run time already alerted per server. A failed run
@@ -1919,11 +1923,18 @@ await _emailAlertService.TrySendAlertEmailAsync(
                 {
                     var worst = breached[0];
                     _activeLowDiskAlert[key] = true;
-                    if (!suppressPopups && (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown))
+                    double? lastLowDiskPercent =
+                        _lastAlertedLowDiskPercent.TryGetValue(key, out var lowDiskPct) ? lowDiskPct : (double?)null;
+                    /* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a
+                       standing full volume (which also re-recorded a history row and made Dismiss feel broken). */
+                    if (!suppressPopups
+                        && LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent)
+                        && (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown))
                     {
                         var muteCtx = new AlertMuteContext { ServerName = summary.DisplayName, MetricName = "Volume Free Space" };
                         bool isMuted = _muteRuleService.IsAlertMuted(muteCtx);
                         _lastLowDiskAlert[key] = now;
+                        _lastAlertedLowDiskPercent[key] = worst.FreePercent;
 
                         if (!isMuted)
                         {
@@ -1955,6 +1966,7 @@ await _emailAlertService.TrySendAlertEmailAsync(
                 else if (_activeLowDiskAlert.TryGetValue(key, out var wasLowDisk) && wasLowDisk)
                 {
                     _activeLowDiskAlert[key] = false;
+                    _lastAlertedLowDiskPercent.Remove(key);
                     if (!suppressPopups)
                     {
                         _trayService.ShowNotification(
diff --git a/PerformanceMonitor.Notifications/LowDiskAlertGate.cs b/PerformanceMonitor.Notifications/LowDiskAlertGate.cs
new file mode 100644
index 00000000..797d9da0
--- /dev/null
+++ b/PerformanceMonitor.Notifications/LowDiskAlertGate.cs
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2026 Erik Darling, Darling Data LLC
+ *
+ * This file is part of the SQL Server Performance Monitor.
+ *
+ * Licensed under the MIT License. See LICENSE file in the project root for full license information.
+ */
+
+namespace PerformanceMonitor.Notifications;
+
+/// <summary>
+/// Decides whether a low-disk ("Volume Free Space") alert should (re-)fire for a server, so a
+/// standing breach does not re-notify — and re-record an alert-history row — every cooldown.
+/// <para>
+/// A full volume is a sustained condition (disk does not free itself), so a plain per-cooldown
+/// level check fires the alert every <c>AlertCooldownMinutes</c> for as long as the breach lasts.
+/// Besides the repeated tray/email, that wrote a fresh alert-history row every cycle, which made
+/// Alert-History "Dismiss" feel broken: each dismissed row was immediately replaced by an
+/// identical, newer one. This mirrors the failed-job watermark (<c>_lastAlertedFailedJobTime</c>)
+/// and the rolling-count edge trigger (<c>RollingCountAlertGate</c>, #1091): notify on a NEW or
+/// WORSENING breach, stay quiet at an unchanged level. Shared by Lite and Dashboard so the rule
+/// cannot drift between the two apps.
+/// </para>
+/// </summary>
+public static class LowDiskAlertGate
+{
+    /// <summary>
+    /// Minimum drop, in free-space percentage points below the last-alerted level, required to
+    /// re-alert. Keeps normal free-space jitter (logs / tempdb growing and shrinking a fraction
+    /// of a percent) from re-tripping the alert while still catching a genuine decline.
+    /// </summary>
+    public const double DefaultWorseningMarginPercent = 1.0;
+
+    /// <summary>
+    /// Returns true when a low-disk alert should fire this cycle.
+    /// </summary>
+    /// <param name="currentWorstFreePercent">
+    /// Free-space percentage of the worst (lowest-free) breached volume this cycle.
+    /// </param>
+    /// <param name="lastAlertedFreePercent">
+    /// Free percentage captured when the alert last fired for this server, or <c>null</c> when
+    /// there is no active low-disk alert (a fresh breach — the caller clears its watermark when
+    /// the condition resolves).
+    /// </param>
+    /// <param name="worseningMarginPercent">
+    /// Required worsening, in percentage points; defaults to <see cref="DefaultWorseningMarginPercent"/>.
+    /// </param>
+    public static bool ShouldAlert(
+        double currentWorstFreePercent,
+        double? lastAlertedFreePercent,
+        double worseningMarginPercent = DefaultWorseningMarginPercent)
+    {
+        /* Fresh breach (no active alert) always notifies. */
+        if (lastAlertedFreePercent is null)
+        {
+            return true;
+        }
+
+        /* Otherwise only when free space has fallen at least the margin below the last-alerted
+           level — i.e. the breach got meaningfully worse. */
+        return currentWorstFreePercent <= lastAlertedFreePercent.Value - worseningMarginPercent;
+    }
+}