v3.0.0#1132
Closed
erikdarlingdata wants to merge 304 commits into
Closed
Conversation
CREATE OR ALTER VIEW is re-run by the installer on every upgrade, so no migration is needed. With the Dashboard now defaulting CpuAlertMode to Total, the view's "today" no-date path (and MCP) now agree with the alert metric rather than reporting SQL-only counts. Users on SqlOnly mode still see Total-based counts on the no-date view path — the view can't be per-user since it's a database object. The date-parameterized path (DatabaseService.GetDailySummaryAsync with a date) honors per-user mode and is unchanged here.
…shows-combined-1004 Overview lanes: plot Total CPU alongside SQL CPU (#1004)
…lert-mode-port Dashboard: port CpuAlertMode (Total/SqlOnly) from Lite (#1004)
Dashboard EmailAlertService deduplicated alert emails with an in-memory
ConcurrentDictionary cooldown keyed {serverId}:{metricName}. It was
never seeded from history, so restarting Dashboard cleared every
cooldown and an alert email sent minutes before the restart could be
sent again immediately.
The first time each alert key is evaluated, the cooldown is now seeded
from the in-memory _alertLog (loaded from alert_history.json on
startup) via a new private GetLastEmailSentUtc helper — the most
recent AlertTime for that server/metric where NotificationType ==
"email" and SendError is empty, mirroring exactly when _cooldowns is
updated after SendEmailAsync. The in-memory dictionary stays
authoritative once seeded.
Brings Dashboard to parity with the Lite-side persistence introduced
in #981. Unlike Lite, Dashboard records email and webhook deliveries
as separate alert-log rows (no "email+webhook" composite), so the
filter is just NotificationType == "email".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alertservice-cooldown-persistence Persist Dashboard EmailAlertService cooldown across restarts (plan 4 PR (a))
Ports the alert-log piggyback pattern from PR #981 to the per-finding analysis notification cooldown on BOTH Lite and Dashboard. Without this, restarting either app cleared the in-memory cooldown dict and a high-severity finding that had just fired could re-spam immediately on next analysis cycle. Design follows plan 4 PR (b) §7. The cooldown dictionary stays on AnalysisNotificationService; persistence is mediated through a new public helper on EmailAlertService (M-2 redesign — avoids a ctor change that would have forced updates to 4 Lite call sites including 3 in Lite.Tests). Lite/Services/EmailAlertService.cs - Add public GetLastAlertTimeAsync(int serverId, string metricName). Mirrors private GetLastEmailSentUtcAsync, but with no notification_type or send_error filter — the analysis cooldown is stamped unconditionally, so the persisted equivalent is the latest row for that metric_name, period. Lite/Services/AnalysisNotificationService.cs - Lazy-seed _cooldowns from config_alert_log on first lookup per key. - Prune entries past 2x cooldown at the top of NotifyAsync so the dict stays bounded (analysis dict grows with distinct StoryPathHash, unlike EmailAlertService's which is bounded by config). - Refresh class-level comment on _cooldowns to match the new behavior (and to match the Dashboard equivalent verbatim). Dashboard/Services/EmailAlertService.cs - Add public GetLastAlertTime(string serverId, string metricName). LINQ over _alertLog under _alertLogLock, no channel/error filter (Dashboard records "email"/"webhook"/"tray" as separate rows — all three count toward the analysis cooldown seed). Dashboard/Services/AnalysisNotificationService.cs - Same lazy-seed + prune pattern. - Lift serverId + metricName resolution above the cooldown check so the seed and the EmailAlertService call use identical keys against alert_history.json (was computed inside the try block before). - Match the Lite _cooldowns comment. Lite.Tests/AnalysisNotificationTests.cs - NotifyAsync_DistinctFindings_EachNotifies used hashes that shared the same first 8 chars ("hash000000000001" / "hash000000000002" -> both "hash0000"), hitting the documented-and-accepted shortHash collision now that persistence keys on metric_name. Switched to distinct shortHashes so the test exercises the non-collision case (test intent preserved: "distinct findings each notify"). Inline comment explains the choice. Build + test: - dotnet build Lite/PerformanceMonitorLite.csproj: 0 warnings, 0 errors - dotnet build Dashboard/Dashboard.csproj: 0 warnings, 0 errors - dotnet test Lite.Tests: 292/292 pass - dotnet test Dashboard.Tests: 18/18 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tLastEmailSentUtc PR #1007 added Dashboard EmailAlertService.GetLastEmailSentUtc (filters to successful "email"-channel sends for the email cooldown). This branch adds Dashboard EmailAlertService.GetLastAlertTime (no channel/error filter, for the analysis cooldown). Both methods coexist with distinct semantics; conflict was mechanical only. CHANGELOG.md: union both bullets — #1007's Dashboard email-cooldown parity entry first (it merged first), then this branch's analysis- cooldown entry. Dashboard/Services/EmailAlertService.cs: git auto-merged. Both GetLastEmailSentUtc (private, channel-filtered) and GetLastAlertTime (public, unfiltered) end up in the file with distinct xmldoc. Builds clean, all tests pass: - Lite.Tests: 292/292 - Dashboard.Tests: 18/18 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cation-cooldown-persistence Persist AnalysisNotificationService cooldown on both apps (plan 4 PR (b))
Move FactScorer, RelationshipGraph, InferenceEngine, AnalysisModels,
IFactCollector, IPlanFetcher, and BlockingChainReconstructor from
Dashboard/Analysis/ and Lite/Analysis/ into a new
PerformanceMonitor.Analysis project targeting net10.0 (no WPF, so future
non-WPF consumers can pick it up without a multi-target rewrite). Both
apps now reference the shared engine via <ProjectReference>; both
Lite.Tests and Dashboard.Tests reach internal members through
InternalsVisibleTo on the new csproj.
AnalysisContext split out of IFactCollector.cs into its own file.
AmplifierDefinition adopts the Dashboard-side `internal sealed`.
FactScorer.cs normalized to LF (the Lite-side CRLF was the only
Lite/Dashboard line-ending divergence).
BlockingChainReconstructor's Lite-specific header phrasing neutralized
and the Dashboard-side <remarks> sync-checker block dropped.
Consumer rewrite: 25 files updated. Four files that only touched
Group A types had their using directive swapped from
PerformanceMonitor{Lite,Dashboard}.Analysis to PerformanceMonitor.Analysis.
The other 21 retain the per-app using AND add the new shared using
because they still reference Group B/C types (AnalysisService,
BaselineProvider, BaselineBucket, etc.) that stay per-app. The 15
per-app staying files in Lite/Analysis/ and Dashboard/Analysis/
gained `using PerformanceMonitor.Analysis;` so their cross-references
to moved types resolve.
Resolves Plan 3's deferred follow-up:
Lite.Tests/BlockingChainReconstructorTests.cs ported to
Dashboard.Tests/BlockingChainReconstructorTests.cs. Dashboard.Tests
now runs the same 10 reconstruction tests as Lite.Tests.
Retires the user-level blocking-reconstructor-sync-checker agent —
there are no longer two copies to sync. PlanAnalyzer and its
planalyzer-sync-checker stay (Services/, outside this extraction's
scope). DB-bound adapters (*FactCollector, *DrillDownCollector,
*FindingStore, *AnomalyDetector, *BaselineProvider, *PlanFetcher)
and the AnalysisService orchestrator stay per-app — they bind to
DuckDBConnection vs SqlConnection.
Verification:
dotnet build PerformanceMonitor.sln -c Debug -> 0 warnings, 0 errors
dotnet test Lite.Tests -> 292 passed (unchanged)
dotnet test Dashboard.Tests -> 28 passed (was 18, +10 from the port)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-library-extraction Extract analysis engine to PerformanceMonitor.Analysis (plan 5)
Move ShowPlanParser, PlanAnalyzer, BenefitScorer, PlanLayoutEngine,
and PlanModels from Dashboard/{Services,Models} and Lite/{Services,Models}
into a new PerformanceMonitor.PlanAnalysis class library targeting
net10.0. Both apps now consume the shared plan group via
<ProjectReference>; ~5,100 LOC of byte-equivalent duplication
eliminated. Mirrors the pattern PR #1009 established for
PerformanceMonitor.Analysis. The two shared libraries are independent
(zero refs in either direction).
PlanAnalyzer.cs gains explicit `using System;`,
`using System.Collections.Generic;`, `using System.Linq;` at the top
because Lite's project-scoped GlobalUsings.cs isn't inherited by the
new project (which sets `<ImplicitUsings>disable</ImplicitUsings>`).
ScanImpact record adopts Dashboard's `sealed` modifier as the
reconciliation winner.
PlanIconMapper split to break an unanticipated incoming edge: the
plan classified PlanIconMapper as out-of-scope (WPF-bound, uses
BitmapImage), but ShowPlanParser calls PlanIconMapper.GetIconName at
parse time to populate PlanNode.IconName. The pure-data half
(IconMap dictionary + GetIconName lookup) is now `internal static
class IconNameMapper` inside the shared library; the per-app
PlanIconMapper.GetIcon (returns BitmapImage) stays per-app. The
per-app PlanIconMapper.GetIconName forwarder is removed — there
were no other callers and the user prefers no dead-code shims.
14 consumer files (7 Lite + 7 Dashboard) gain
`using PerformanceMonitor.PlanAnalysis;`. The two
ActualPlanExecutor-only consumers (Dashboard/Controls/
QueryPerformanceContent.Plans.cs and Lite/Controls/ServerTab.Plans.cs)
are intentionally not touched — ActualPlanExecutor stays per-app
this release because it calls ReproScriptBuilder (Class B, drifted
between Lite and Dashboard); both will be extracted in a follow-up
PR once ReproScriptBuilder is reconciled and a logging abstraction
is designed.
CHANGELOG updated. planalyzer-sync-checker agent retired (no copies
to sync).
Verified:
- dotnet build PerformanceMonitor.sln -c Debug: clean, 0 warnings.
- dotnet test Lite.Tests: 292 passed (matches origin/dev baseline).
- dotnet test Dashboard.Tests: 28 passed (matches origin/dev baseline).
- code-reviewer agent: no BLOCKING / MAJOR findings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hared-library-extraction Extract plan-analysis to PerformanceMonitor.PlanAnalysis (Class A)
…e v1 PR (a)) PR (a) of a planned two-PR workstream for the triage / action v1 feature. Adds the shared data layer that PR (b)'s rendering surfaces (email, Teams, Slack, in-app Alert Details window) will consume; ships zero user-visible advice content on its own. PerformanceMonitor.Analysis additions: - FactAdvice — operator-grade advice prose keyed by fact-key. 42 first-class blocks covering every fact-key the engine can emit (CPU, memory, blocking, I/O, latch, tempdb, query-level, DB config, jobs/disk, BAD_ACTOR, 9 ANOMALY_*), plus an ANOMALY_WAIT_* composer that wraps the inner wait's prose with an anomaly framing prelude. AdviceBlock record carries Headline / Investigation / Remediation, plus optional RemediationTsql populated by FactRemediation. - FactRemediation — generated sp_query_store_force_plan T-SQL for PLAN_REGRESSION findings only. Up to five EXEC blocks per finding, each with a header comment (database, query_id, plan hashes, regression factor), the USE [db]; + EXEC, and a commented back-out via sp_query_store_unforce_plan. PARAMETER_SENSITIVITY explicitly returns null — forcing the worst sensitive plan locks in the wrong plan for some parameter values; the advice prose steers toward OPTION(RECOMPILE) / OPTIMIZE FOR / plan guides instead. Drill-down extension (§6.2 of the plan): - Both Lite/Analysis/DrillDownCollector.cs (DuckDB) and Dashboard/Analysis/SqlServerDrillDownCollector.cs (SQL Server) — the PLAN_REGRESSION drill-down's plan_dedup CTE now projects MAX(plan_id), carried forward as best_plan_id (long) in the C# reader projection. Query Store retention purges older plans first, so MAX yields the most recent plan_id sharing the hash — least likely to be evicted at force time. The two collectors' plan_agg CTEs are structurally different (DuckDB groups by query_id + plan_id and harvests query_plan_hash via any_value; SQL Server groups by all four) but the MAX(plan_id) projection in plan_dedup is symmetric. Lite BuildContext precursor (§7.0 of the plan): - Lite/Services/AnalysisNotificationService.cs — BuildContext signature changed to (AnalysisFinding finding, double notifyThreshold), mirroring Dashboard. Now prepends a Diagnosis AlertDetailItem at Details[0] carrying Story / Severity / Notify threshold / Confidence / Facts / Database / Window before the drill-down items. DetailText takes the threshold as a parameter instead of reading App.AnalysisNotifySeverity inline. NotifyAsync threads the existing threshold local through to both call sites. This is the one user-visible change in PR (a) — Lite's email and webhook output will now include the Diagnosis card that Dashboard's already shows. Test updates: - Lite.Tests/AnalysisNotificationTests.cs — BuildContext/DetailText callers pass notifyThreshold, expectations account for the new Diagnosis-at-[0] shape, plus a new test confirming the Diagnosis carries the threshold through. Verification: - dotnet build clean on PerformanceMonitor.Analysis, Lite, Dashboard. - Lite.Tests 293/293 passed (including the updated BuildContext tests). - Dashboard.Tests 28/28 passed. - code-reviewer agent run on the diff — no blocking findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The time-range header at the top of every Dashboard tab and the Query Performance heatmap x-axis used 12-hour `h:mm tt`, while everything else in the app (footers, DataGrid columns, slicer, tooltips, logs) already used 24-hour `HH:mm`. The AM/PM marker was also getting truncated in the column shown in the issue. Lite's heatmap had the same straggler. Normalized all four to `HH:mm`.
Normalize Dashboard/Lite time labels to 24-hour (#1012)
…ernal DMVs)
The previous prose told operators to run external DMV queries (sys.dm_os_schedulers,
sys.dm_io_virtual_file_stats, sys.dm_exec_query_memory_grants, etc.) and to use
external toolbox procedures (sp_BlitzCache, sp_WhoIsActive, sp_BlitzFirst) as
their primary investigation step. That orientation was wrong in two specific ways:
1. Transient state mismatch — the alert was raised from analysis over a
historical window. The DMVs the prose pointed at show live state right now,
which has no necessary relationship to what fired the alert.
2. Product premise inversion — PerformanceMonitor exists to capture this data
into persistent stores, surface it in in-app tabs, and expose it via MCP
tools. Pointing operators back at the live server admits the tool failed.
Every Investigation section is now ordered as:
1. In-app tab / sub-tab that shows the captured data for the analysis window,
by exact tab name (Wait Stats, CPU, Memory → Memory Grants, Resource Metrics
→ TempDB Stats, Blocking → Blocked Process Reports, etc.)
2. MCP tool name (verified against McpAnalysisTools.cs, McpWaitTools.cs,
McpBlockingTools.cs, McpCpuTools.cs, McpMemoryTools.cs, McpIoTools.cs,
McpQueryTools.cs, McpPlanTools.cs, etc. on the dev branch)
3. The drill-down field already attached to this finding, by exact field name
(top_cpu_queries, reconstructed_blocking_chains.apex_sleeping,
parameter_sensitive_queries.worker_ratio, regressed_queries.best_plan_id,
bad_actor_query.plan_analysis, etc.)
4. Persistent external stores (Query Store, msdb job history) only when the
captured data does not cover the question
5. Transient DMVs only with explicit "live confirmation" framing — defaulted
to none in this pass
Two specific corrections baked in per the user's flag:
- THREADPOOL: now treats BLOCKING_CHAIN and CXPACKET co-elevation as PEER
causes, not blocking-first. Names the parallel-thread-consumption pathway
explicitly. Cross-references the engine's THREADPOOL amplifier (FactScorer.
ThreadpoolAmplifiers) so reviewers can see the scoring agrees.
- CXPACKET: removes the "CXCONSUMER is benign on its own" claim, which was
wrong. CXCONSUMER specifically signals producer/consumer skew. The
collector's GroupParallelismWaits (in DuckDbFactCollector.cs /
SqlServerFactCollector.cs) groups it back into CXPACKET as an
implementation choice; the prose now names that grouping explicitly and
points the operator at plan-skew as a real diagnostic angle when the
per-execution numbers do not match what raw DOP would predict.
PARAMETER_SENSITIVITY keeps its existing "**Do not force a plan**" framing
(this was structurally correct in the prior prose).
The AdviceBlock record shape, GetForFactKey lookup logic, BAD_ACTOR_ and
ANOMALY_WAIT_ prefix handling, LCK_RANGE alias assignments, and the
ANOMALY_WAIT_ composer are unchanged. Only the prose strings inside each
new AdviceBlock(...) call have been rewritten. 42 of 42 blocks rewritten.
dotnet build clean (PerformanceMonitor.Analysis and full solution).
Lite.Tests 293/293 pass; Dashboard.Tests 28/28 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ER_SENSITIVITY, PLAN_REGRESSION, anomalies, etc.) The advice-rewrite in PR #1011 names specific MCP tools in the Investigation prose for each fact-key. Several fact-keys had no ByFactKey entry at all, so the next_tools array returned to MCP clients was empty for those findings. Add entries for the gaps and augment a few existing ones with tools the prose was already naming. Lite (Lite/Mcp/McpAnalysisTools.cs): - PAGEIOLATCH_EX: add get_tempdb_trend - WRITELOG: add get_perfmon_trend (Transactions/sec) - SCH_M: add get_running_jobs - PARAMETER_SENSITIVITY: new entry (get_top_queries_by_cpu, analyze_query_plan, get_query_trend, get_memory_grants) - PLAN_REGRESSION: new entry (analyze_query_store_plan, get_query_trend, get_query_store_top) - ANOMALY_SESSION_SPIKE: new entry (get_session_stats, get_active_queries, get_waiting_tasks) - ANOMALY_QUERY_DURATION: new entry (get_query_duration_trend, get_top_queries_by_cpu, analyze_query_plan) - ANOMALY_MEMORY_PRESSURE: new entry (get_memory_stats, get_memory_clerks, get_memory_pressure_events, get_memory_grants) - ANOMALY_BATCH_REQUESTS: new entry (get_perfmon_trend, get_top_queries_by_cpu, get_active_queries) Dashboard (Dashboard/Mcp/McpAnalysisTools.cs): - PAGEIOLATCH_EX: add get_tempdb_trend - WRITELOG: add get_perfmon_trend (Transactions/sec) - SCH_M: new entry (get_blocking, get_running_jobs) — was absent - PARAMETER_SENSITIVITY: new entry (uses get_resource_semaphore where Lite uses get_memory_grants) - PLAN_REGRESSION: new entry - ANOMALY_SESSION_SPIKE / _QUERY_DURATION / _MEMORY_PRESSURE / _BATCH_REQUESTS: new entries (no get_waiting_tasks or get_query_duration_trend in Dashboard — substituted equivalents) Not added (verified-but-not-applicable): - get_resource_semaphore on Lite: tool only exists on Dashboard; Lite uses get_memory_grants and is already mapped there. - get_resource_semaphore on Dashboard RESOURCE_SEMAPHORE: already present, no change needed.
The orientation rewrite correctly qualified 8 Lite-only / Dashboard-only
tool citations as "(Lite)" / "(Dashboard)" but missed 7 other blocks that
still cited Lite-only tools unqualified. Dashboard users following that
advice would be told to call MCP tools that do not exist on their server.
Mechanical fix only — no orientation, scoring, or prose-shape changes.
Blocks touched (Lite tool -> Dashboard substitute):
- THREADPOOL: get_waiting_tasks -> get_cpu_scheduler_pressure
(scheduler-side queue pressure)
- LCK: get_waiting_tasks -> get_blocking (live lock-detail view)
- LCK_RANGE: get_waiting_tasks -> get_blocking (same)
- BLOCKING_EVENTS: get_blocking_trend -> get_blocking_deadlock_stats
(combined blocking+deadlock trend)
- BLOCKING_CHAIN: get_blocked_process_reports -> get_blocked_process_xml
(parsed XML stream)
- ANOMALY_BLOCKING_SPIKE: same two substitutes (trend + parsed events)
- ANOMALY_DEADLOCK_SPIKE: get_deadlock_trend -> get_blocking_deadlock_stats
Cosmetic adjacent fix in CXPACKET: the parallelism-grouping file cite now
names both DuckDbFactCollector.GroupParallelismWaits (Lite) and
SqlServerFactCollector.GroupParallelismWaits (Dashboard) instead of only
the Lite file path.
Verification:
- dotnet build PerformanceMonitor.Analysis -c Debug: clean, 0 warnings.
- Lite.Tests: 293/293 passed.
- Dashboard.Tests: 28/28 passed.
- Sanity grep for unqualified Lite-only tool names in FactAdvice.cs
returns zero hits.
Orientation prose, THREADPOOL first half, CXCONSUMER framing,
PARAMETER_SENSITIVITY "do not force a plan" line, and all drill-down /
FactScorer references are unchanged. No blocks added or removed.
…1-pr-a Add advice + remediation data layer (triage v1 PR (a))
…tions-gaps Extend ToolRecommendations registry for missing fact-keys
…R (b)) Six rendering surfaces (Lite + Dashboard × dialog/email/webhook) now surface FactAdvice prose + FactRemediation T-SQL from PR #1011's data layer. AlertDetailItem gains Body + IsCodeBlock axes. Drill-down detail now persists via context_json (Lite v26 migration) / AlertLogEntry.ContextJson (Dashboard JSON) so the in-app dialog finally renders the same shape email/webhooks already show. Plan: ~/.claude/plans/triage-v1-pr-b-plan.md (two adversarial review rounds). Deviations from the plan (mechanism only, no design changes): - Added a co-located AlertContextSerializer static helper at the bottom of each app's AlertContext.cs (alongside the DTO records) to centralize the AlertContext<->DTO map/unmap. Used by the persistence write, the dialog read, and the round-trip test, so the three call sites cannot drift. Stays within "no new files." - The dialog ItemsControl binds to a projected bindable view-model (DetailItemView/FieldView, co-located in each code-behind) instead of the plan's literal `DetailItems.ItemsSource = ctx.Details`: WPF cannot data-bind to AlertDetailItem.Fields because List<(string,string)> exposes Item1/Item2 as fields, not properties. Same ItemsControl + DataTemplate + DataTrigger structure the plan specifies; only the bound type differs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…1-pr-b Render advice + remediation T-SQL in alert notifications across Lite and Dashboard (triage v1 PR (b))
…onsolidation precursor) New PerformanceMonitor.Notifications net10.0 lib holds IAlertSettings; a Lite AppAlertSettings adapter passes through to App.* statics (live, no cache). EmailAlertService / WebhookAlertService / AnalysisNotificationService now read settings through the interface instead of App.* directly. SendEmailAsync (static, live path) takes IAlertSettings as a parameter. Behaviour-preserving; no schema/XAML/ Dashboard changes. Gate for Plan E (extract the alert services into this lib). Plan: ~/.claude/plans/plan-d-settings-interface.md (1 review round). No deviations from the plan. (Build-ordering note only: Steps 3 and 4 were built together rather than separately, because the rewired ctors and the MainWindow construction sites live in the same Lite project and neither compiles without the other. The file set and per-step edits are exactly as the plan specifies.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-interface Introduce injectable IAlertSettings on Lite (Plan D — alert-service consolidation precursor)
Mirror of Plan D on the Dashboard side: EmailAlertService / WebhookAlertService / AnalysisNotificationService now read settings through IAlertSettings via a new DashboardAlertSettings adapter over IUserPreferencesService (+ credential statics for password/URLs). Clamp moved into the adapter. Test-email path uses a transient form-values adapter to preserve "test before save". Behaviour- preserving; no storage/schema change; Lite untouched. After E1 both apps satisfy IAlertSettings. Gate for E2 (storage interfaces) and E3 (extraction). Plan: ~/.claude/plans/plan-e-alert-service-extraction.md §5. Deviation: EmailAlertService keeps an IUserPreferencesService ctor parameter ALONGSIDE the new IAlertSettings one, solely for the two LogAlertDismissals reads in HideAlerts/HideAllAlerts. Those are a history-management logging toggle, not an alert setting, so the field is NOT on IAlertSettings. §5 said "ctor takes IAlertSettings instead of UserPreferencesService"; a strict literal swap would break those reads. HideAlerts/HideAllAlerts (and this field) move to JsonAlertHistoryStore in E3c (§4.2). The settings reads themselves all move to IAlertSettings exactly as specced. Minimal, behaviour- preserving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rd-settings Dashboard IAlertSettings adapter (Plan E Stage E1)
…n E Stage E3a) AlertContext/AlertDetailItem/AlertContextDto/AlertContextSerializer + EmailTemplateBuilder move to PerformanceMonitor.Notifications; per-app copies deleted. Edition string + snooze hint parameterized via a shared AlertBranding record (Lite passes the snooze hint, Dashboard passes null -> omits it). Absorbs the PR (b) AlertContext/DTO duplication. Behaviour-preserving -- emails byte-identical per app (render invariant test). Gate for E3c. Plan: ~/.claude/plans/plan-e-alert-service-extraction.md 7.1. Deviations: - Lib csproj already had InternalsVisibleTo for both apps + both test projects (per plan section 1); no IVT change needed. - Most consumers already had `using PerformanceMonitor.Notifications;` from E1; only 4 files needed it added. - AlertsHistoryContent.xaml.cs (named in the plan) is NOT a consumer at dev@3239790 -- it references none of the moved types. - Added an internal `EmailAlertService.Branding` test seam (internal, test-only via IVT; no public surface) so the render-invariant test gates each app's actually-wired branding rather than a re-typed copy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ontext-template Plan E E3a: extract AlertContext family + EmailTemplateBuilder to shared lib
IAlertHistoryStore + IMuteRuleStore (+ ILogger) added to PerformanceMonitor.Notifications; per-app DuckDb*/Json* store impls wrap the existing DuckDB/JSON persistence verbatim. Alert-service bodies now persist + log through interfaces (still per-app). No schema or JSON-shape change — persisted rows/files byte-identical. Dashboard alert-history management API + AlertLogEntry move into JsonAlertHistoryStore. Gate for E3b/E3c. Plan: ~/.claude/plans/plan-e-alert-service-extraction.md §6. Deviations: - MuteRule (+ AlertMuteContext) moved to the lib in E2 (pulled forward from E3b) per user decision: IMuteRuleStore lives in the lib (§6) but §4.2's signature references MuteRule, which otherwise stays per-app until E3b — a blocking contradiction. Moving the model now keeps IMuteRuleStore non-generic exactly as §4.2 specifies. Deleted both per-app MuteRule.cs; added the lib using to ~10 consumer files. (Byte-identical merge per §7.2.) - Dashboard EmailAlertService ctor takes the concrete JsonAlertHistoryStore (which implements IAlertHistoryStore) rather than the bare interface, because it forwards the Dashboard-only management API (GetAlertHistory/Hide*/Save) — not on IAlertHistoryStore — to keep AlertsHistoryContent + McpAlertTools working unchanged until E3c repoints them. Thin forwarders marked transitional. - IUserPreferencesService param removed from Dashboard EmailAlertService: grep confirmed only HideAlerts/HideAllAlerts used it; those moved to the store, which now owns the param for the LogAlertDismissals reads. - Lite AnalysisNotificationService left untouched (not in §6's Lite-Modified list); both WebhookAlertServices untouched. Their logger swaps happen in E3c. - Per-app stores keep their native static loggers verbatim (Lite AppLogger, Dashboard Helpers.Logger) — they are per-app forever, not shared-bound, so persistence-error log text is preserved. Dashboard GetSmtpPassword/ SaveSmtpPassword statics likewise keep Helpers.Logger (relocated in E3c). - NoWarn: added CA2254 (Lite + Dashboard) and CA1848 (Dashboard) for the §4.3-mandated ILogger adoption (interpolated log messages); CA1848 was already suppressed in Lite. CA1725 fixed by renaming the store UpdateAsync param to match the interface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-interfaces Plan E E2: storage + logging interfaces for alert services
…thread Perf: offload FinOps/Recommendations + Dashboard ServerTab data path (P7)
The alert-check paths ran synchronous DuckDB queries directly on the WPF dispatcher: MainWindow.CheckPerformanceAlerts (poison waits, blocked-process reports, deadlocks, long-running queries, tempdb space, anomalous jobs) and ServerTab.RefreshAlertCountsAsync (alert counts). DuckDB.NET is synchronous, so `await _dataService.X()` completes on the calling thread; under load a single DuckDB connection open is ~766ms. These fire on the 60s overview auto-refresh and on every Blocking-tab refresh, causing intermittent ~1s UI freezes that landed on whatever tab the user happened to be on. Wrap the 9 calls in Task.Run(() => _dataService.X(...)) -- the same off-thread pattern used across the other refresh paths; the off-thread sweep (#1110-1120) missed the alert/overview paths. Root-caused with dotnet-trace (dotnet-sampled-thread-time); method-level Stopwatch instrumentation had ruled out chart render (0-3ms), GC (0), and grid binding (<30ms). Measured (Lite vs SQL2025 under HammerDB TPC-C, 75s spanning the overview auto-refresh): dispatcher latency MAX 1174ms -> 9.2ms, p99 2.0ms -> 1.5ms, zero stalls >16ms (was 2 over 250ms including one over 1s). Lite.Tests 434/434. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hart-render Perf: off-thread alert-check DuckDB queries (Lite UI freeze under load)
Dismiss All already fired AlertsDismissed to acknowledge the matching server-tab alert badge(s); Dismiss Selected did not, so the badge stayed lit after dismissing individual rows. Fire AlertsDismissed for each distinct server represented in the dismissed rows -- selected rows can span servers when the filter is "all" -- matching Dismiss All's behavior. Completes the #1092 badge-clearing work (core fixed in #1100). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lected-badge Lite: Dismiss Selected clears the server-tab badge too (#1092)
Mirrors the existing tempdb-space alert in both apps: a settings toggle +
two thresholds, a live check in the alert loop with cooldown, an alert-history
metric ("Volume Free Space"), and a notification/email context naming the
worst (lowest-free) volume.
Fires when any monitored volume's free space drops below a configurable
percentage OR a fixed amount in GB (either dimension set to 0 disables it).
One alert per server, naming the tightest volume and listing the breaching
ones. Not applicable to Azure SQL DB: the collector writes NULL volume stats
there, and the read filters volume_mount_point IS NOT NULL, so no rows -> no
alert.
New settings (both apps, default enabled to match tempdb-space):
AlertLowDiskEnabled / NotifyOnLowDisk = true
AlertLowDiskThresholdPercent / LowDiskThresholdPercent = 10
AlertLowDiskThresholdGb / LowDiskThresholdGb = 5
Data source: existing per-volume free space already collected into
database_size_stats (Lite v_database_size_stats; Dashboard
collect.database_size_stats). New reads aggregate files to distinct volumes,
worst-first. Also added "Volume Free Space" to both mute-rule dialogs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#768) Add optional --data-path / --log-path CLI flags so the installer can place the PerformanceMonitor data (.mdf) and log (.ldf) files in caller-specified server-side directories. When omitted, behavior is unchanged: the database is created using SERVERPROPERTY('InstanceDefaultDataPath'/'InstanceDefaultLogPath'). - 01_install_database.sql declares @data_path_override/@log_path_override and an inert comment token the installer replaces with SET statements. The CREATE DATABASE step uses the override when present, else the instance default. Only applies on first creation (guarded by IF DB_ID(...) IS NULL); ignored if the database already exists. Managed Instance (engine edition 8) is unaffected. - Injection safety is two-layer: PathValidation rejects relative paths, control characters (CR/LF/tab), and "<>|*? plus a length cap; the C# layer doubles single quotes when building the SET literal; and the SQL builds the FILENAME literal via REPLACE(@path, N'''', N''''''). BuildFilePathOverrideSql re-validates so the Core API is self-defending regardless of caller. - New PathValidation class (Installer.Core) with 34 unit tests; help text, usage doc, and README options table updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a user-toggleable "Failed Agent Job" alert to both apps. Existing job alerts cover duration (long-running / anomalous) but nothing on job FAILURE outcome. This alerts when a SQL Agent job run failed within a lookback window. Failure outcomes are not part of the collected running_jobs snapshot, so the check runs a live msdb query at alert-check time: msdb.dbo.sysjobs joined to msdb.dbo.sysjobhistory on step_id = 0 (job outcome) + run_status = 0 (Failed), converting run_date/run_time to a server-local datetime and filtering to the last N minutes. Degrades gracefully (returns empty, never faults the cycle) when the login lacks msdb / SQLAgentReaderRole access; skipped on Azure SQL DB (no SQL Agent) via engineEdition == 5. Settings (defaults: enabled, 60-minute lookback): - Lite: App.AlertFailedJobEnabled / AlertFailedJobLookbackMinutes, persisted as alert_failed_job_* and surfaced in SettingsWindow. - Dashboard: UserPreferences.NotifyOnFailedJobs / FailedJobLookbackMinutes. Dedup: one alert per server summarizing the failed job name(s); a per-server run-time watermark prevents re-firing the same failure every cooldown while still alerting on a strictly newer failure. Alert history metric "Failed Agent Job". Lite routes the live read through the collector's CreateConnectionAsync (MFA serialization / throttle / retry) rather than a raw connection; Dashboard runs it on the existing per-server MARS connection alongside the other alert queries. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ile-locations Installer: optional custom data/log file locations (#768)
…ace-alert Add low volume free space alert to Lite and Dashboard (#754)
Resolves the DatabaseService.NocHealth GetAlertHealthAsync fan-out conflict with #754 (low-disk alert) by keeping both volumeTask and failedJobTask in the Task.WhenAll array. Also addresses the security review of the failed-job query: the catch in GetRecentlyFailedJobsAsync (both apps) was too broad -- it swallowed timeouts and cancellation, so a genuine msdb read failure could masquerade as "no failed jobs". Now expected permission errors (229/297/300/916) log quietly and return empty; unexpected errors log at Warning and return empty (still not faulting the alert cycle); Lite lets OperationCanceledException propagate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alert Add configurable failed SQL Agent job alert to Lite and Dashboard (#749)
- Bump Version/AssemblyVersion/FileVersion/InformationalVersion to 3.0.0 in Dashboard, Lite, Installer, and Installer.Core. The Installer/Installer.Core AssemblyVersion bump is load-bearing: ScriptProvider.FilterUpgrades only applies an upgrade folder whose ToVersion <= the installer's own assembly version, so without it the 2.11.0-to-3.0.0 folder would be silently skipped. - Rename upgrades/2.11.0-to-2.12.0 -> upgrades/2.11.0-to-3.0.0. - Write the [3.0.0] CHANGELOG section: an Important note plus the gaps the release audit flagged (code-review hardening sweep #1093-#1108, the off-UI-thread perf overhaul #1116/#1121, object/index-level collection #1103, the Recommendations/Apply-Fix engine, #1085, #1092/#1122, #1096, #754, #749, #768), and fix the stale "2.11.0 -> 2.12.0" upgrade-script reference. Validation: Installer unit tests 61/61; Installer integration tests 19/19 against live SQL2022 (idempotency, version detection, adversarial upgrade paths) at the 3.0.0 assembly version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…FinOps object/index analysis) Lite collector table/count 24->25 with the new index_object_stats row; Dashboard collector count 32->33; FinOps description gains the per-object table/index size/growth/usage/locking analysis (#1103). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…arseable events The second-phase parsers (collect.process_blocked_process_xml -> sp_HumanEventsBlockViewer, collect.process_deadlock_xml -> sp_BlitzLock) only marked a captured event processed when the parse produced >= 1 row. Events that legitimately parse to zero rows -- a self-block or non-lock wait (e.g. a memory-grant RESOURCE_SEMAPHORE wait that tripped blocked_process_threshold, reported as a session blocking itself), or an unparseable deadlock graph -- were never marked, so every cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual NO_RESULTS; the staging table never drained. Both processors now mark events processed after any clean parse run and log SUCCESS. The XACT_STATE()/CATCH paths still roll back and retry genuine failures. The blocked-process processor's half-open parse window (event_time < @end_date) also dropped single-timestamp batches (MIN = MAX, the common case when a monitor loop emits all reports at one instant); the upper bound is padded +1s to match the deadlock processor, which already did this. Validated: sql2017's real 11-event self-block backlog drains (SUCCESS, no re-loop); sql2025's 733 real chains + 81k parsed deadlocks confirm the parse path is unchanged. Regression test tools/test_blocked_process_processor.sql (real self-block + two-session samples at a single timestamp) passes all assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…" edition
SERVERPROPERTY('Edition') returns the legacy "SQL Azure" string on Azure SQL
Database, which showed verbatim in the FinOps Server Inventory Edition column.
Both the Server Inventory live query (LocalDataService.FinOps.ServerProperties)
and the server_properties collector (RemoteCollectorService.ServerProperties)
now map engine edition 5 to "Azure SQL Database (<service tier>)" using
DATABASEPROPERTYEX(DB_NAME(),'Edition') -- e.g. "Azure SQL Database (General
Purpose)". On-prem editions are unchanged; the recommendation engine only
string-matches "Enterprise", which neither value contains, so logic is unaffected.
Verified live against an Azure SQL DB GP-serverless test instance: the Server
Inventory Edition column now reads "Azure SQL Database (General Purpose)".
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the Lite edition fix (b144697) across the Dashboard and the SQL-side collectors so "SQL Azure" is never shown or stored for an Azure SQL DB anywhere: - Dashboard FinOps Server Inventory live query (DatabaseService.FinOps.Inventory) - install/53_collect_server_properties.sql (stored server_properties.edition) - install/42_scheduled_master_collector.sql (server inventory edition) All map engine edition 5 to "Azure SQL Database (<service tier>)" via DATABASEPROPERTYEX(DB_NAME(),'Edition'); on-prem editions take the unchanged ELSE branch. The licensing-recommendation queries (Lite + Dashboard) are left raw and identical -- they only do a Contains("Enterprise") check and never display the edition for Azure. Verified: Dashboard CASE returns "Azure SQL Database (General Purpose)" against a live Azure SQL DB; both install scripts deploy clean (0 errors) to SQL2019 and the Dashboard builds with 0 warnings/errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Release prep for v3.0.0 (version bump, upgrades rename, changelog)
…#754) The Volume Free Space alert treated free space as a level: while a volume stayed below threshold it re-fired every AlertCooldownMinutes (default 5) in both apps. Each cycle also wrote a fresh Alert-History row, so dismissing the alert appeared broken -- the dismissed row was replaced by an identical, newer one on the next cycle. Add a shared LowDiskAlertGate (PerformanceMonitor.Notifications) that fires only on a fresh breach or one that has worsened >= 1 percentage point of free space below the last-alerted level, and clear the per-server watermark when the volume recovers. Mirrors the failed-job watermark and the #1091 RollingCountAlertGate edge trigger. Wired identically into Dashboard and Lite. Covered by LowDiskAlertGateTests (9 cases). Both apps build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix low-disk alert re-firing every cooldown for standing full volumes (#754)
…749) The per-server tab badge was driven only by blocking/deadlock/CPU/memory, so a server whose only problem was a full volume or a failed Agent job showed no tab indicator -- the alert surfaced only as a one-shot tray toast and an Alert-History row. Both apps now fold the alert engine's active low-disk / failed-job state into the badge: it lights while the condition is active, auto-clears when the disk recovers or the failure ages out, and acknowledges/silences like the other badges. Persistent-indicator complement to the low-disk re-fire fix (#1127). Dashboard: HasLowDiskAlert/HasFailedJobAlert on ServerHealthStatus, injected in the UpdateTabBadge choke point from _activeLowDiskAlert/_activeFailedJobAlert, folded into HasAnyAlertCondition + the acknowledge AlertBaseline. Lite: its badge is fed by the blocking/deadlock tab refresh, a separate path from the alert sweep -- which use different server identities (GUID for tabs vs int DuckDB id for the sweep). The sweep resolves the ServerConnection via the existing deterministic-hash bridge, sets GUID-keyed badge flags, and re-renders via RefreshServerBadgeExtras with the last-known counts; UpdateAlertCounts ORs the two flags. Covered by AlertBadgeConditionTests (5 cases). Both apps build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This release was developed as 2.12.0 then rebumped to the 3.0.0 major; the embedded "2.12.0" strings were never updated. Most are user-facing: the Apply Fix handlers and the remediation confirm dialog told users to "Upgrade this server to 2.12.0 to enable audited Apply Fix" -- a version that will never ship (3.0.0 does create config.remediation_action_log, so the gate is satisfied; only the named version was wrong). Replaced 2.12.0 -> 3.0.0 in the 6 Apply Fix handler messages + RemediationConfirmWindow dialog text, two code comments, the Remove-OrphanedTraceFiles.ps1 doc comment, and the four upgrades/2.11.0-to-3.0.0 script headers (which still read "2.11.0 to 2.12.0"). The four Dashboard.Tests asserts pinning the literal message text were updated in lockstep. Found by the release-auditor pre-release gate. Dashboard builds clean; Dashboard.Tests 487/487 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace stale 2.12.0 version strings with 3.0.0
Correct stale 3.0.0 counts and document new 3.0 features in README: - Collector counts: download table 30->33, edition comparison 32/24->33/25 - MCP tool counts: 51-63->55-66 (highlights), 63/52->66/55 (comparison + MCP intro) - install/ script count 58->59 (folder structure) - Add "What You Get" bullet for the Recommendations / Apply Fix (advise-and-act) engine with informed-consent destructive fixes - Full Overview tab: add "recommendations"; add Lite "Recommendations" tab row - Alert Types: add Volume Free Space (10% or 5 GB; never on Azure SQL DB) and Failed Agent Job (60-min lookback; skipped on Azure SQL DB) rows - MCP tools table: add Object/Index Stats row (get_table_index_sizes, get_index_usage, get_object_locking; registered in both apps); remove erroneous Lite-only markers from the Diagnostic Analysis row (those six tools are registered in the Dashboard too) - CI badge: point at build.yml (ci.yml does not exist) so it renders Install-log cleanup: PerformanceMonitor_Install_*.txt logs are already untracked and covered by .gitignore (line 45); no git changes needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs: sync README to 3.0.0 facts + note install-log cleanup
Pre-tag code review (3 agents) surfaced six fix-before-release defects in the alert-badge and XML-processor changes made this release. All fixed and verified. Lite server-tab badge parity (#1128): - Write the low-disk/failed-job badge flags ONCE per sweep from locals, not only inside the feature-enabled/online/msdb branches — a disabled feature or an offline server now CLEARS a stale badge instead of leaving it lit. - A false->true badge transition is a genuinely new condition, so it clears any acknowledgement (new AlertStateService.ClearAcknowledgementForNewCondition). The boolean low-disk/failed-job conditions carry no event timestamp, so UpdateAlertCounts could never re-show them after a dismiss — matching the Dashboard's re-show on a new disk/job condition. - Drop per-server badge state on tab close (dict leak + stale flash on reopen). - Cover the ack / re-show logic with AlertBadgeAckTests (4 tests). install/25_process_deadlock_xml.sql: - Log SUCCESS on any clean parse run (0 reconstructable graphs is success, not failure); ends the perpetual NO_RESULTS this proc logged AFTER it had already marked the rows processed. Mirrors process_blocked_process_xml. - Move the +1s window pad to the parser's LOCAL upper bound only; the mark uses the un-padded UTC @end_date so a deadlock inserted in that extra second is left for the next run, not marked unparsed. install/53_collect_server_properties.sql: - Include DATABASEPROPERTYEX Edition/ServiceObjective in the row_hash so an Azure SQL DB tier/SLO change with no vCore/memory delta is detected (SERVERPROPERTY Edition is the constant 'SQL Azure', so it alone never changed the hash). Verified: Lite build OK; Lite.Tests 447, Dashboard.Tests 487 pass; installer applies 55/55 on SQL2017 with 46 collectors running. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix #1128 review defects before v3.0.0 tag
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release v3.0.0 — cut from
devat the fully-reviewed, all-green tip. Merges 304 commits of 3.0.0 work intomain. No breaking changes; existing installations upgrade in place viaupgrades/2.11.0-to-3.0.0/and the apps auto-update over the top.Highlights (see CHANGELOG.md for the full list)
code-review-*series (Fix shared-lib defects from Fable code review (chunk 1) #1093–Fix SQL schema/job/validation defects from Fable code review (chunk 7) #1108) across SQL schema/collectors/views, installer, Lite/Dashboard services, and shared libraries.ALTER DATABASE SETfixes plus destructive RCSI / clear-plan fixes behind informed two-sided consent.xp_delete_filefix ([BUG] Fix #951 — clean up Monitor_LongQueries_*.trc files in data_retention not working and job failing #972), and more.Pre-tag review fixes folded in (#1131, this branch's tip)
AlertBadgeAckTests.install/25deadlock processor logsSUCCESSon clean parse (ends perpetualNO_RESULTS); window pad corrected.install/53row_hash includes AzureDATABASEPROPERTYEXEdition/ServiceObjective so tier/SLO changes are detected.Validation
🤖 Generated with Claude Code