Wake-on-LAN targets can be sent automatically whenever this desktop boots.
script/common.sh: shared ansible and server-id helpers for remote boot scriptsscript/wake_targets.sh: send magic packets for one or more named targetsscript/run_remote_boot.sh: boot entrypoint that loads config and runs the target scriptscript/create_test_container.sh: create a temporary GPU container without touching the DBscript/delete_test_container.sh: remove the temporary test containerscript/check_server_boot_health.sh: verify mount, GPU, and docker/container readiness; supports a host-only monitor modescript/wait_for_priority_servers.sh: retry health checks until timeout before waking the restscript/restart_all_remote_containers.sh: start stopped containers on selected servers with retry, then run per-container SSH/GPU post-checks; supports a limited monitor modescript/run_remote_boot_monitor.sh: periodically run host health checks and limited container self-heal checks against selected serversscript/integration_smoke_test.sh: manual ansible/docker/GPU smoke test before enabling boot automationscript/dry_run_remote_boot.sh: dry-run wrapper for wake, health, container, and full-flow simulationsscript/test_slack_notification.sh: send a real Slack test message using local configscript/reset_remote_boot_alert_state.sh: clear stored alert suppression state so the same failure can notify againscript/install_remote_boot_service.sh: installs and enables the systemd boot servicescript/install_remote_boot_monitor_timer.sh: installs and enables the 15-minute systemd monitor timerconfig/remote_boot.local.env: local defaults used at boot time
- Copy the example config and edit only your local file
cp config/remote_boot.example.env config/remote_boot.local.env- Fill in server-specific values in
config/remote_boot.local.env - Review
config/remote_boot.local.env - Install the boot service
./script/install_remote_boot_service.sh- Reboot, or run once manually
sudo systemctl start remote-boot.serviceService registration and management:
# Install and enable at boot
./script/install_remote_boot_service.sh
# Install and start immediately
./script/install_remote_boot_service.sh --start-now
# Check status
systemctl status remote-boot.service
# Read logs
journalctl -u remote-boot.service -b
tail -f /var/log/remote-boot.logPeriodic monitor timer:
# Install and enable the 15-minute monitor timer
./script/install_remote_boot_monitor_timer.sh
# Install and start the service immediately once, then keep the timer active
./script/install_remote_boot_monitor_timer.sh --start-now
# Check timer/service status
systemctl status remote-boot-monitor.timer
systemctl status remote-boot-monitor.service
# Read logs
journalctl -u remote-boot-monitor.service -n 100
tail -f /var/log/remote-boot-monitor.logThe periodic monitor keeps its scope limited:
- it does not send WOL packets
- it does not create the temporary test container
- it does not run remount/reload/restart recovery actions on the host
- it starts containers that are in a stopped state
- it may run
service ssh startinside containers when SSH is down - it does not restart containers or restart the Docker daemon
- it checks mount, host GPU, docker daemon reachability, container SSH, and GPU availability for
decscontainers
Alert suppression:
- the same failure alert is sent only once while its alert-state file exists
- when the matching check later succeeds, the alert state is cleared automatically
- you can also clear alert state manually with
./script/reset_remote_boot_alert_state.sh
List available targets:
./script/wake_targets.sh --list-targetsWake a group manually:
./script/wake_targets.sh allBoot orchestration with staged wake-up:
REMOTE_BOOT_PRIORITY_TARGETS="FARM1 LAB1"is sent firstREMOTE_BOOT_ENABLE_GATE=truewaits for priority servers to pass health checks- the gate retries for up to
REMOTE_BOOT_GATE_TIMEOUT_SECONDS=360 - once the gate passes, the remaining selected targets are sent
- if
REMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK=true, the remaining selected targets also run the same host health checks after wake-up - finally, if
REMOTE_BOOT_ENABLE_CONTAINER_RESTART=true, all selected servers start stopped containers only, then each container is checked forsshandnvidia-smi - when a recovery path still cannot fix the issue, the system tries to send a Slack webhook alert and falls back to a stub alert log if Slack is disabled or delivery fails
Standalone test container commands:
./script/create_test_container.sh --server-id FARM1
./script/delete_test_container.sh --server-id FARM1Recommended manual integration test:
./script/integration_smoke_test.sh --scope priorityManual periodic monitor run:
./script/run_remote_boot_monitor.sh
./script/run_remote_boot_monitor.sh FARM1 LAB1
./script/run_remote_boot_monitor.sh --dry-runIn monitor mode, host checks are limited to mount, host GPU, and docker daemon availability.
Container checks start stopped containers, verify SSH for every container, try service ssh start when needed, and verify GPU only for decs / dguailab/decs containers.
Manual alert-state reset:
./script/reset_remote_boot_alert_state.sh --all
./script/reset_remote_boot_alert_state.sh --server-id FARM1
./script/reset_remote_boot_alert_state.sh --server-id FARM1 --stage container_monitor
./script/reset_remote_boot_alert_state.sh --server-id LAB1 --stage mount_checkDry-run entrypoints:
# 1. WOL call simulation
./script/dry_run_remote_boot.sh wake FARM1 LAB1
# 2. Host mount/GPU check plus test-container plan
./script/dry_run_remote_boot.sh health FARM1
# 3. Start-stopped-containers flow and per-container SSH/GPU plan
./script/dry_run_remote_boot.sh containers FARM1
# 4. Full orchestration
./script/dry_run_remote_boot.sh --scope priority full
./script/dry_run_remote_boot.sh fullDry-run behavior:
wakeandfulldo not send WOL packets, sleep, create containers, restart Docker, or restart containers.healthvalidates config and inventory, then prints the exact host checks, test-container create/delete commands, and automatic recovery commands that would be used.containersdoes not start or restart anything, but it does read the current remote container inventory so it can show which stopped containers would be started and which containers would receive SSH/GPU checks.- For actual verification after a host is already up, use
./script/check_server_boot_health.sh --server-id FARM1and./script/restart_all_remote_containers.sh FARM1.
config/remote_boot.local.env is grouped into these sections:
- Remote boot target groups:
REMOTE_BOOT_FARM_TARGETS,REMOTE_BOOT_LAB_TARGETS,REMOTE_BOOT_TARGETS - Boot order and gate behavior:
REMOTE_BOOT_PRIORITY_TARGETS,REMOTE_BOOT_ENABLE_GATE,REMOTE_BOOT_GATE_*,REMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK,REMOTE_BOOT_SECONDARY_DELAY_SECONDS - Post-boot container start/post-check flow:
REMOTE_BOOT_ENABLE_CONTAINER_RESTART,REMOTE_BOOT_CONTAINER_RESTART_*,REMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_* - Ansible / network:
REMOTE_BOOT_ANSIBLE_INVENTORY, broadcast IPs - Wake-on-LAN MAC addresses:
REMOTE_BOOT_MAC_<TARGET> - Host health-check requirements:
required NFS mounts,
REMOTE_BOOT_HOST_SHARE_MOUNT_TEMPLATE - Temporary test container for health checks:
REMOTE_BOOT_TEST_* - Logging / alerts:
REMOTE_BOOT_ENABLE_HEALTH_LOGGING, log paths, alert state paths, rotate count - Periodic health monitor:
REMOTE_BOOT_MONITOR_TARGETS,REMOTE_BOOT_MONITOR_ENABLE_HOST_HEALTH_CHECK,REMOTE_BOOT_MONITOR_ENABLE_CONTAINER_CHECK,REMOTE_BOOT_MONITOR_ON_CALENDAR,REMOTE_BOOT_MONITOR_LOG_*
Most commonly changed options:
REMOTE_BOOT_TARGETS: default targets to bootREMOTE_BOOT_PRIORITY_TARGETS: first servers to wake and verifyREMOTE_BOOT_ENABLE_GATE: whether the remaining servers wait for priority health checksREMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK: whether the remaining servers also run host health checks after they wakeREMOTE_BOOT_ENABLE_CONTAINER_RESTART: whether stopped containers are started after boot and all containers are post-checkedREMOTE_BOOT_MONITOR_TARGETS: which already running servers are checked by the 15-minute timerREMOTE_BOOT_MONITOR_ENABLE_HOST_HEALTH_CHECK,REMOTE_BOOT_MONITOR_ENABLE_CONTAINER_CHECK: whether the periodic timer runs host checks, container checks, or bothREMOTE_BOOT_TEST_IMAGE_REPOSITORY,REMOTE_BOOT_TEST_IMAGE,REMOTE_BOOT_TEST_VERSION: the temporary health-check container imageREMOTE_BOOT_FARM_TARGETS,REMOTE_BOOT_LAB_TARGETS,REMOTE_BOOT_MAC_<TARGET>: what exists in each group and how to wake itREMOTE_BOOT_SLACK_ENABLED,REMOTE_BOOT_SLACK_WEBHOOK_URL,REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM,REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB: whether real Slack alerts are sent and whether alerts route to a generic webhook, a FARM-specific webhook, or a LAB-specific webhookREMOTE_BOOT_ALERT_STATE_DIR: where alert suppression state files are stored so the same failure is not sent repeatedly until it is cleared or auto-reset by a later success
- In
config/remote_boot.local.env, set:
REMOTE_BOOT_SLACK_ENABLED=true
REMOTE_BOOT_SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..." # optional fallback
REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM="https://hooks.slack.com/services/..." # FARM alerts
REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB="https://hooks.slack.com/services/..." # LAB alerts- Send a test message:
./script/test_slack_notification.sh
./script/test_slack_notification.sh --server-id FARM1
./script/test_slack_notification.sh --server-id LAB1You can also override the message text:
./script/test_slack_notification.sh --server-id FARM1 --message "remote_boot slack test (farm)"
./script/test_slack_notification.sh --server-id LAB1 --message "remote_boot slack test (lab)"If Slack is configured, alerts containing FARM* server IDs go to REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM, alerts containing LAB* server IDs go to REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB, mixed alerts are sent to both when both are configured, and REMOTE_BOOT_SLACK_WEBHOOK_URL acts as a fallback.
Manual health-check logs:
./script/check_server_boot_health.sh --server-id FARM1This keeps terminal output and also writes a per-run log under logs/health/ by default.
Use --log-file /path/to/file.log to override the destination.
Log format:
2026-03-11T15:10:00+0900 [HEALTH] context=check_server_boot_health server=FARM1 stage=mount_check required_mount=...
config/remote_boot.local.envis ignored by.gitignore- commit
config/remote_boot.example.envand keep real server-specific values, including MAC addresses, only inconfig/remote_boot.local.env - when a server is added, update
REMOTE_BOOT_FARM_TARGETSorREMOTE_BOOT_LAB_TARGETSplus the matchingREMOTE_BOOT_MAC_<TARGET>value inconfig/remote_boot.local.env
wakeonlanmust be installed on this desktop.wake_targets.shreads MAC addresses fromREMOTE_BOOT_MAC_<TARGET>variables inconfig/remote_boot.local.env.LAB*targets use192.168.1.255, andFARM*targets use192.168.2.255by default.- remote scripts can use
REMOTE_BOOT_ANSIBLE_INVENTORY, or fall back to your existingansible.cfgdefault inventory. - host mount checks expect
100.100.100.100:/294t/dcloud/sharefor LAB and100.100.100.120:/volume1/sharefor FARM. - host NFS remount recovery uses
REMOTE_BOOT_HOST_SHARE_MOUNT_TEMPLATE, which defaults to/home/tako%s/share. - automatic recovery commands use
sudo -non the remote hosts; if passwordless sudo is not available there, recovery will not run and the failure will fall through to the alert stub log. - boot health checks create a temporary GPU test container directly via Docker and remove it without writing to the DB.
- health-check runs can write per-run logs to
REMOTE_BOOT_HEALTH_LOG_DIRwhenREMOTE_BOOT_ENABLE_HEALTH_LOGGING=true. - service and orchestration logs use an ISO timestamp plus tag format like
[BOOT],[GATE],[HEALTH],[WAKE],[CONTAINER], and[SMOKE]. - unrecovered failures are written to
REMOTE_BOOT_ALERT_STUB_LOG_FILEwhen Slack is disabled or Slack delivery fails. - test container share mounts can use
REMOTE_BOOT_TEST_SHARE_SOURCE_TEMPLATE="/home/tako%s/share/user-share/";%sis replaced with the server number, soFARM1andLAB1both use/home/tako1/share/user-share/. - test container GPU launch uses
REMOTE_BOOT_TEST_DOCKER_RUNTIME="auto"by default, so hosts without a registerednvidiaruntime still run with--gpus. - post-boot container handling starts only containers that are in a stopped state; after that, each container is checked for SSH, but GPU checks run only for containers whose image is
decsordguailab/decswith any tag. CPU-only containers are logged as skipped. - post-restart per-container checks use
REMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_TIMEOUT_SECONDSandREMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_POLL_SECONDS. - If the network is not ready at boot, increase
REMOTE_BOOT_PRE_DELAY_SECONDS.