feat(tamanu): TAM-6782: lifecycle subcommands (start/stop/restart/status)#352
Merged
Conversation
Annotate API and frontend as Critical (must always have one instance up), everything else expected Up as Background. Drives the upcoming restart subcommand's rolling vs bulk decision; ignored for start, stop, and status.
New tamanu-lifecycle feature gates a new module shared by the upcoming start/stop/restart/status subcommands. First primitive is match_names: a substring-based filter over the expectation set with union semantics across multiple names. Empty names = pass-through; any zero-match name bails with the available list so a typo in a multi-name invocation doesn't silently drop.
Discovery lifts the systemd/pm2 enumeration logic that doctor was keeping private. Instance carries the supervisor identifiers needed to build the eventual systemctl/pm2 commands (unit() / display()). group_by_expectation joins discovered instances onto the expectation they belong to, dropping anything not in the expected set.
A lighter cousin of tamanu doctor: enumerates services known to the supervisor and renders them against the canonical expectation set with running/missing counts. No HTTP probes, no DB queries. Exits non-zero if any Up expectation is short of its min_count or any Down expectation has a running instance. Takes a variadic NAMES positional matcher (matches via lifecycle::match_names, substring union). --json emits a serialisable wire shape for piping into other tools.
Idempotent bring-up: enumerates required units against discovery and issues a single systemctl start (or pm2 start) for whatever's missing. Self-elevates via sudo on systemd if not root. Waits for everything it started to become active before returning. Adds Instances::required_systemd_units for computing the canonical unit names a Single/NumericAtLeast/Named expectation requires. For pm2, bails if the deployment has fewer registered processes than the expectation needs; first-time pm2 setup stays in the ops playbook.
Mirror of start: gathers every running instance under the matched expectations and issues a single supervisor stop call. Self-elevates under sudo on systemd. Waits for everything to be inactive before returning. Caddy is not touched; its upstreams just become unreachable which is the operator's intent for a maintenance window. No critical/background ordering — once the operator decides to bring things down, the supervisor's synchronous stop is enough.
Rolling restart that splits running instances by criticality: background services restart in a single bulk supervisor call, then critical services (api, frontend) roll one instance at a time with a per-instance HTTP probe + caddy reload + cooldown between each. The probe URL is derived from podman netavark on systemd (container IP + :3000) or pm2's PORT env var. Flags from #313: --cooldown (default 30s, jiff-parsed), --no-probe-http, --check-url for an end-to-end probe after the roll. Lifts reload_caddy, container_ip_for_unit, and the pm2 port lookup verbatim from #313's reload.rs into the lifecycle module, where they join restart_one, wait_running_one, and the new bulk_restart helper.
All four subcommands (status/start/stop/restart) are implemented with the planned shape, plus self-elevation, USAGE.md regen, and the criticality field on Expectation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖
Closes #313.
Four new subcommands on top of the
services::expected()model from #336 and the post-#341 Context API.Subcommands
tamanu status [NAMES...]— discovery render against the canonical expectation set. No sudo, no probes, no DB. Exits non-zero if any Up expectation is short of its min_count or any Down expectation has a running instance.--jsonemits a wire shape.tamanu start [NAMES...]— idempotent bring-up. Computes the canonical units a Single/NumericAtLeast/Named expectation requires, diffs against discovery, and issues a single batchedsystemctl start/pm2 startfor whatever's missing. Bails if a pm2-side expectation has fewer registered processes than needed (first-time pm2 setup stays with the ops playbook).tamanu stop [NAMES...]— symmetric with start. Single batched stop call across every running matched instance. Caddy untouched.tamanu restart [NAMES...]— rolling restart split by criticality:wait_running_one+ per-instance HTTP probe +caddy reload+resolvectl flush-caches+ cooldown between each, so caddy picks up the new netavark IP before the next probe lands.--cooldown(default 30s, jiff-parsed),--no-probe-http,--check-url URLfor a final end-to-end probe.NAMES matcher
All four subcommands take a variadic positional
NAMES.... Each name is a substring against the expectation name; an expectation matches if any name matches it (union). EmptyNAMES= all expectations. Any zero-match name bails listing both the bad pattern and the available names (typo safety in multi-name invocations).Implementation
Criticalityfield onExpectationplus tests fixing the matrix per kind/supervisor.lifecycle.rsmodule holding:config_and_expectations,match_names,discover/Instance/group_by_expectation,ensure_root_or_reexec(sudo re-exec on systemd hosts),restart_one,wait_running/wait_running_one/wait_stopped,reload_caddy,container_ip_for_unit,pm2_port_for. The container-IP and caddy-reload helpers are lifted verbatim from feat(tamanu): TAM-6782: add reload subcommand for safe rolling restart #313's reload.rs.tamanu-lifecyclecargo feature; gated alongside the existingtamanu-*feature set.logs.rsis untouched in this PR. A follow-up reshapes it to uselifecycle::match_namesand adds caddy-as-pseudo-service.The plan file (
docs/plans/tamanu-lifecycle.md) is committed as the base of the stack and will be unplanned at the end.