Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Lite and Dashboard: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database (<service tier>)` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Normalized at every edition display/storage site across **both** apps — the live inventory queries (Lite + Dashboard) and the SQL-side collectors (`install/42`, `install/53`) plus Lite's `server_properties` collector — so the value is consistent app-wide; on-prem editions are unchanged. (The licensing-recommendation queries are left raw and identical in both apps: they only do an `Enterprise` substring check and never display the edition for Azure.)
- **Dashboard: "Deadlocks Cleared" no longer flaps right after every deadlock** ([#1091]) — deadlock detection is edge-triggered off a delta against the cumulative perfmon counter, so the check immediately after a deadlock saw a zero delta and fired a *"Deadlocks Cleared"* notification ~one interval (≈60s) after every *"Deadlock Detected"*. The alert now stays active and clears only once a deadlock-quiet window (1 hour) has elapsed since the last new deadlock, so the detect/clear pair lines up with Lite, whose rolling 1-hour count drains about an hour after the last deadlock. Each new deadlock resets the window. The clear message is now *"No deadlocks in the last hour"* (was *"No deadlocks since last check"*). Covered by `DeadlockAlertClearPolicyTests`
- **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests`
- **Lite and Dashboard: low-disk (Volume Free Space) alert no longer re-fires every cooldown for a standing full volume** — a breached volume is a sustained condition, but the alert engine treated free space as a level and re-fired every `AlertCooldownMinutes` (default 5) for as long as the volume stayed below threshold. Besides the repeated tray/email, every cycle wrote a fresh Alert-History row, so dismissing the alert appeared not to work — the dismissed row was immediately replaced by an identical, newer one. The alert is now gated by a shared `LowDiskAlertGate` that notifies only on a **fresh** breach or one that has **worsened** by at least 1 percentage point of free space below the last-alerted level, and clears its watermark when the volume recovers — mirroring the failed-job watermark and the [#1091] rolling-count edge trigger. Fixed identically in both apps. Covered by `LowDiskAlertGateTests`
- **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover
- **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency
- **Blocked-process and deadlock XML processors no longer loop on un-parseable events** — the second-phase parsers (`collect.process_blocked_process_xml` → `sp_HumanEventsBlockViewer`, and `collect.process_deadlock_xml` → `sp_BlitzLock`) only marked a captured event processed when the parse produced at least one row. Events that legitimately yield zero — a *self-block* or non-lock wait (e.g. a memory-grant `RESOURCE_SEMAPHORE` wait that tripped `blocked_process_threshold`, which SQL Server reports as a session blocked by itself), or a deadlock graph the parser can't reconstruct — were never marked, so every collection cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual `NO_RESULTS` while the staging table never drained. Both processors now mark events processed after any clean parse run and log `SUCCESS`; genuine parse failures still roll back and retry. Separately, the blocked-process processor's parse window was half-open (`event_time < @end_date`), so a batch of reports sharing one timestamp — the common case, since a blocked-process monitor loop emits every report at a single instant — fell outside `[MIN, MAX)` and was silently dropped; the upper bound is now inclusive (matching the deadlock processor). Covered by `tools/test_blocked_process_processor.sql` using real self-block and two-session samples
Expand Down
14 changes: 13 additions & 1 deletion Dashboard/MainWindow.xaml.cs
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,11 @@ public partial class MainWindow : Window
private readonly ConcurrentDictionary<string, bool> _activeTempDbSpaceAlert = new();
private readonly ConcurrentDictionary<string, DateTime> _lastLowDiskAlert = new();
private readonly ConcurrentDictionary<string, bool> _activeLowDiskAlert = new();
/* Worst free-% captured at the last low-disk alert per server (#754 follow-up): a full
volume is a standing condition, so without this the alert re-fired — and re-recorded an
alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate to notify
only on a fresh or worsening breach; removed when the condition resolves. */
private readonly ConcurrentDictionary<string, double> _lastAlertedLowDiskPercent = new();
private readonly ConcurrentDictionary<string, DateTime> _lastLongRunningJobAlert = new();
private readonly ConcurrentDictionary<string, bool> _activeLongRunningJobAlert = new();
private readonly ConcurrentDictionary<string, DateTime> _lastFailedJobAlert = new();
Expand Down Expand Up @@ -1979,11 +1984,17 @@ await _emailAlertService.TrySendAlertEmailAsync(
{
var worst = breachedVolumes[0];
_activeLowDiskAlert[serverId] = true;
if (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown)
double? lastLowDiskPercent =
_lastAlertedLowDiskPercent.TryGetValue(serverId, out var lowDiskPct) ? lowDiskPct : (double?)null;
/* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a
standing full volume (which also re-recorded a history row and made Dismiss feel broken). */
if (LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent)
&& (!_lastLowDiskAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown))
{
var muteCtx = new AlertMuteContext { ServerName = serverName, MetricName = "Volume Free Space" };
bool isMuted = _muteRuleService.IsAlertMuted(muteCtx);
_lastLowDiskAlert[serverId] = now;
_lastAlertedLowDiskPercent[serverId] = worst.FreePercent;
var lowDiskContext = BuildVolumeFreeSpaceContext(breachedVolumes);
var detailText = ContextToDetailText(lowDiskContext);
var currentValue = $"{worst.MountPoint} {worst.FreePercent:F0}% free ({worst.FreeGb:F1} GB)";
Expand Down Expand Up @@ -2017,6 +2028,7 @@ await _emailAlertService.TrySendAlertEmailAsync(
}
else if (_activeLowDiskAlert.TryRemove(serverId, out var wasLowDisk) && wasLowDisk)
{
_lastAlertedLowDiskPercent.TryRemove(serverId, out _);
_notificationService?.ShowNotification("Volume Free Space Resolved",
$"{serverName}: All volumes back above threshold");
_emailAlertService.RecordAlert(serverId, serverName, "Volume Free Space Resolved",
Expand Down
61 changes: 61 additions & 0 deletions Lite.Tests/LowDiskAlertGateTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
using PerformanceMonitor.Notifications;
using Xunit;

namespace PerformanceMonitorLite.Tests;

/// <summary>
/// Guards <see cref="LowDiskAlertGate"/> (#754 follow-up): a standing full volume must not
/// re-alert every cooldown — only a fresh or meaningfully-worsening breach fires. Without this,
/// the low-disk alert re-recorded an identical alert-history row every cooldown, which made
/// Alert-History "Dismiss" appear broken (the dismissed row was replaced by a newer one). The
/// gate is shared by Lite and Dashboard, so this one suite covers both apps' behaviour.
/// </summary>
public class LowDiskAlertGateTests
{
[Fact]
public void FreshBreach_NoWatermark_Alerts()
{
Assert.True(LowDiskAlertGate.ShouldAlert(currentWorstFreePercent: 8.0, lastAlertedFreePercent: null));
}

[Fact]
public void UnchangedLevel_DoesNotReAlert()
{
Assert.False(LowDiskAlertGate.ShouldAlert(8.0, 8.0));
}

[Fact]
public void SlightlyImproved_DoesNotReAlert()
{
Assert.False(LowDiskAlertGate.ShouldAlert(8.4, 8.0));
}

[Fact]
public void WorsenedWithinJitterMargin_DoesNotReAlert()
{
// Dropped 0.5pp (< 1.0 default margin) — jitter, not a genuine decline.
Assert.False(LowDiskAlertGate.ShouldAlert(7.5, 8.0));
}

[Fact]
public void WorsenedExactlyMargin_ReAlerts()
{
// Dropped exactly the 1.0pp default margin.
Assert.True(LowDiskAlertGate.ShouldAlert(7.0, 8.0));
}

[Fact]
public void WorsenedBeyondMargin_ReAlerts()
{
Assert.True(LowDiskAlertGate.ShouldAlert(5.5, 8.0));
}

[Theory]
[InlineData(8.0, 8.0, 2.0, false)] // unchanged, custom larger margin
[InlineData(6.0, 8.0, 2.0, true)] // dropped exactly the custom margin
[InlineData(6.5, 8.0, 2.0, false)] // dropped 1.5pp (< 2.0 custom margin)
public void RespectsCustomMargin(double current, double last, double margin, bool expected)
{
Assert.Equal(expected, LowDiskAlertGate.ShouldAlert(current, last, margin));
}
}
14 changes: 13 additions & 1 deletion Lite/MainWindow.xaml.cs
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ public partial class MainWindow : Window
private readonly Dictionary<string, bool> _activeLongRunningQueryAlert = new();
private readonly Dictionary<string, bool> _activeTempDbSpaceAlert = new();
private readonly Dictionary<string, bool> _activeLowDiskAlert = new();
/* Worst free-% captured at the last low-disk alert per server (#754 follow-up): see the
Dashboard counterpart. Without it a standing full volume re-fired — and re-recorded an
alert-history row, defeating Dismiss — every cooldown. Gated by LowDiskAlertGate; removed on resolve. */
private readonly Dictionary<string, double> _lastAlertedLowDiskPercent = new();
private readonly Dictionary<string, bool> _activeLongRunningJobAlert = new();
private readonly Dictionary<string, DateTime> _lastFailedJobAlert = new();
/* Watermark of the most-recent failed-job run time already alerted per server. A failed run
Expand Down Expand Up @@ -1919,11 +1923,18 @@ await _emailAlertService.TrySendAlertEmailAsync(
{
var worst = breached[0];
_activeLowDiskAlert[key] = true;
if (!suppressPopups && (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown))
double? lastLowDiskPercent =
_lastAlertedLowDiskPercent.TryGetValue(key, out var lowDiskPct) ? lowDiskPct : (double?)null;
/* #754 follow-up: notify only on a fresh or worsening breach, not every cooldown for a
standing full volume (which also re-recorded a history row and made Dismiss feel broken). */
if (!suppressPopups
&& LowDiskAlertGate.ShouldAlert(worst.FreePercent, lastLowDiskPercent)
&& (!_lastLowDiskAlert.TryGetValue(key, out var lastLowDisk) || now - lastLowDisk >= alertCooldown))
{
var muteCtx = new AlertMuteContext { ServerName = summary.DisplayName, MetricName = "Volume Free Space" };
bool isMuted = _muteRuleService.IsAlertMuted(muteCtx);
_lastLowDiskAlert[key] = now;
_lastAlertedLowDiskPercent[key] = worst.FreePercent;

if (!isMuted)
{
Expand Down Expand Up @@ -1955,6 +1966,7 @@ await _emailAlertService.TrySendAlertEmailAsync(
else if (_activeLowDiskAlert.TryGetValue(key, out var wasLowDisk) && wasLowDisk)
{
_activeLowDiskAlert[key] = false;
_lastAlertedLowDiskPercent.Remove(key);
if (!suppressPopups)
{
_trayService.ShowNotification(
Expand Down
63 changes: 63 additions & 0 deletions PerformanceMonitor.Notifications/LowDiskAlertGate.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
/*
* Copyright (c) 2026 Erik Darling, Darling Data LLC
*
* This file is part of the SQL Server Performance Monitor.
*
* Licensed under the MIT License. See LICENSE file in the project root for full license information.
*/

namespace PerformanceMonitor.Notifications;

/// <summary>
/// Decides whether a low-disk ("Volume Free Space") alert should (re-)fire for a server, so a
/// standing breach does not re-notify — and re-record an alert-history row — every cooldown.
/// <para>
/// A full volume is a sustained condition (disk does not free itself), so a plain per-cooldown
/// level check fires the alert every <c>AlertCooldownMinutes</c> for as long as the breach lasts.
/// Besides the repeated tray/email, that wrote a fresh alert-history row every cycle, which made
/// Alert-History "Dismiss" feel broken: each dismissed row was immediately replaced by an
/// identical, newer one. This mirrors the failed-job watermark (<c>_lastAlertedFailedJobTime</c>)
/// and the rolling-count edge trigger (<c>RollingCountAlertGate</c>, #1091): notify on a NEW or
/// WORSENING breach, stay quiet at an unchanged level. Shared by Lite and Dashboard so the rule
/// cannot drift between the two apps.
/// </para>
/// </summary>
public static class LowDiskAlertGate
{
/// <summary>
/// Minimum drop, in free-space percentage points below the last-alerted level, required to
/// re-alert. Keeps normal free-space jitter (logs / tempdb growing and shrinking a fraction
/// of a percent) from re-tripping the alert while still catching a genuine decline.
/// </summary>
public const double DefaultWorseningMarginPercent = 1.0;

/// <summary>
/// Returns true when a low-disk alert should fire this cycle.
/// </summary>
/// <param name="currentWorstFreePercent">
/// Free-space percentage of the worst (lowest-free) breached volume this cycle.
/// </param>
/// <param name="lastAlertedFreePercent">
/// Free percentage captured when the alert last fired for this server, or <c>null</c> when
/// there is no active low-disk alert (a fresh breach — the caller clears its watermark when
/// the condition resolves).
/// </param>
/// <param name="worseningMarginPercent">
/// Required worsening, in percentage points; defaults to <see cref="DefaultWorseningMarginPercent"/>.
/// </param>
public static bool ShouldAlert(
double currentWorstFreePercent,
double? lastAlertedFreePercent,
double worseningMarginPercent = DefaultWorseningMarginPercent)
{
/* Fresh breach (no active alert) always notifies. */
if (lastAlertedFreePercent is null)
{
return true;
}

/* Otherwise only when free space has fallen at least the margin below the last-alerted
level — i.e. the breach got meaningfully worse. */
return currentWorstFreePercent <= lastAlertedFreePercent.Value - worseningMarginPercent;
}
}
Loading