Remote Boot

Wake-on-LAN targets can be sent automatically whenever this desktop boots.

Files

script/common.sh: shared ansible and server-id helpers for remote boot scripts
script/wake_targets.sh: send magic packets for one or more named targets
script/run_remote_boot.sh: boot entrypoint that loads config and runs the target script
script/create_test_container.sh: create a temporary GPU container without touching the DB
script/delete_test_container.sh: remove the temporary test container
script/check_server_boot_health.sh: verify mount, GPU, and docker/container readiness; supports a host-only monitor mode
script/wait_for_priority_servers.sh: retry health checks until timeout before waking the rest
script/restart_all_remote_containers.sh: start stopped containers on selected servers with retry, then run per-container SSH/GPU post-checks; supports a limited monitor mode
script/run_remote_boot_monitor.sh: periodically run host health checks and limited container self-heal checks against selected servers
script/integration_smoke_test.sh: manual ansible/docker/GPU smoke test before enabling boot automation
script/dry_run_remote_boot.sh: dry-run wrapper for wake, health, container, and full-flow simulations
script/test_slack_notification.sh: send a real Slack test message using local config
script/reset_remote_boot_alert_state.sh: clear stored alert suppression state so the same failure can notify again
script/install_remote_boot_service.sh: installs and enables the systemd boot service
script/install_remote_boot_monitor_timer.sh: installs and enables the 15-minute systemd monitor timer
config/remote_boot.local.env: local defaults used at boot time

Quick start

Copy the example config and edit only your local file

cp config/remote_boot.example.env config/remote_boot.local.env

Fill in server-specific values in config/remote_boot.local.env
Review config/remote_boot.local.env
Install the boot service

./script/install_remote_boot_service.sh

Reboot, or run once manually

sudo systemctl start remote-boot.service

Service registration and management:

# Install and enable at boot
./script/install_remote_boot_service.sh

# Install and start immediately
./script/install_remote_boot_service.sh --start-now

# Check status
systemctl status remote-boot.service

# Read logs
journalctl -u remote-boot.service -b
tail -f /var/log/remote-boot.log

Periodic monitor timer:

# Install and enable the 15-minute monitor timer
./script/install_remote_boot_monitor_timer.sh

# Install and start the service immediately once, then keep the timer active
./script/install_remote_boot_monitor_timer.sh --start-now

# Check timer/service status
systemctl status remote-boot-monitor.timer
systemctl status remote-boot-monitor.service

# Read logs
journalctl -u remote-boot-monitor.service -n 100
tail -f /var/log/remote-boot-monitor.log

The periodic monitor keeps its scope limited:

it does not send WOL packets
it does not create the temporary test container
it does not run remount/reload/restart recovery actions on the host
it starts containers that are in a stopped state
it may run service ssh start inside containers when SSH is down
it does not restart containers or restart the Docker daemon
it checks mount, host GPU, docker daemon reachability, container SSH, and GPU availability for decs containers

Alert suppression:

the same failure alert is sent only once while its alert-state file exists
when the matching check later succeeds, the alert state is cleared automatically
you can also clear alert state manually with ./script/reset_remote_boot_alert_state.sh

Manual usage

List available targets:

./script/wake_targets.sh --list-targets

Wake a group manually:

./script/wake_targets.sh all

Boot orchestration with staged wake-up:

REMOTE_BOOT_PRIORITY_TARGETS="FARM1 LAB1" is sent first
REMOTE_BOOT_ENABLE_GATE=true waits for priority servers to pass health checks
the gate retries for up to REMOTE_BOOT_GATE_TIMEOUT_SECONDS=360
once the gate passes, the remaining selected targets are sent
if REMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK=true, the remaining selected targets also run the same host health checks after wake-up
finally, if REMOTE_BOOT_ENABLE_CONTAINER_RESTART=true, all selected servers start stopped containers only, then each container is checked for ssh and nvidia-smi
when a recovery path still cannot fix the issue, the system tries to send a Slack webhook alert and falls back to a stub alert log if Slack is disabled or delivery fails

Standalone test container commands:

./script/create_test_container.sh --server-id FARM1
./script/delete_test_container.sh --server-id FARM1

Recommended manual integration test:

./script/integration_smoke_test.sh --scope priority

Manual periodic monitor run:

./script/run_remote_boot_monitor.sh
./script/run_remote_boot_monitor.sh FARM1 LAB1
./script/run_remote_boot_monitor.sh --dry-run

In monitor mode, host checks are limited to mount, host GPU, and docker daemon availability. Container checks start stopped containers, verify SSH for every container, try service ssh start when needed, and verify GPU only for decs / dguailab/decs containers.

Manual alert-state reset:

./script/reset_remote_boot_alert_state.sh --all
./script/reset_remote_boot_alert_state.sh --server-id FARM1
./script/reset_remote_boot_alert_state.sh --server-id FARM1 --stage container_monitor
./script/reset_remote_boot_alert_state.sh --server-id LAB1 --stage mount_check

Dry-run entrypoints:

# 1. WOL call simulation
./script/dry_run_remote_boot.sh wake FARM1 LAB1

# 2. Host mount/GPU check plus test-container plan
./script/dry_run_remote_boot.sh health FARM1

# 3. Start-stopped-containers flow and per-container SSH/GPU plan
./script/dry_run_remote_boot.sh containers FARM1

# 4. Full orchestration
./script/dry_run_remote_boot.sh --scope priority full
./script/dry_run_remote_boot.sh full

Dry-run behavior:

wake and full do not send WOL packets, sleep, create containers, restart Docker, or restart containers.
health validates config and inventory, then prints the exact host checks, test-container create/delete commands, and automatic recovery commands that would be used.
containers does not start or restart anything, but it does read the current remote container inventory so it can show which stopped containers would be started and which containers would receive SSH/GPU checks.
For actual verification after a host is already up, use ./script/check_server_boot_health.sh --server-id FARM1 and ./script/restart_all_remote_containers.sh FARM1.

Config guide

config/remote_boot.local.env is grouped into these sections:

Remote boot target groups: REMOTE_BOOT_FARM_TARGETS, REMOTE_BOOT_LAB_TARGETS, REMOTE_BOOT_TARGETS
Boot order and gate behavior: REMOTE_BOOT_PRIORITY_TARGETS, REMOTE_BOOT_ENABLE_GATE, REMOTE_BOOT_GATE_*, REMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK, REMOTE_BOOT_SECONDARY_DELAY_SECONDS
Post-boot container start/post-check flow: REMOTE_BOOT_ENABLE_CONTAINER_RESTART, REMOTE_BOOT_CONTAINER_RESTART_*, REMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_*
Ansible / network: REMOTE_BOOT_ANSIBLE_INVENTORY, broadcast IPs
Wake-on-LAN MAC addresses: REMOTE_BOOT_MAC_<TARGET>
Host health-check requirements: required NFS mounts, REMOTE_BOOT_HOST_SHARE_MOUNT_TEMPLATE
Temporary test container for health checks: REMOTE_BOOT_TEST_*
Logging / alerts: REMOTE_BOOT_ENABLE_HEALTH_LOGGING, log paths, alert state paths, rotate count
Periodic health monitor: REMOTE_BOOT_MONITOR_TARGETS, REMOTE_BOOT_MONITOR_ENABLE_HOST_HEALTH_CHECK, REMOTE_BOOT_MONITOR_ENABLE_CONTAINER_CHECK, REMOTE_BOOT_MONITOR_ON_CALENDAR, REMOTE_BOOT_MONITOR_LOG_*

Most commonly changed options:

REMOTE_BOOT_TARGETS: default targets to boot
REMOTE_BOOT_PRIORITY_TARGETS: first servers to wake and verify
REMOTE_BOOT_ENABLE_GATE: whether the remaining servers wait for priority health checks
REMOTE_BOOT_ENABLE_REMAINING_HEALTH_CHECK: whether the remaining servers also run host health checks after they wake
REMOTE_BOOT_ENABLE_CONTAINER_RESTART: whether stopped containers are started after boot and all containers are post-checked
REMOTE_BOOT_MONITOR_TARGETS: which already running servers are checked by the 15-minute timer
REMOTE_BOOT_MONITOR_ENABLE_HOST_HEALTH_CHECK, REMOTE_BOOT_MONITOR_ENABLE_CONTAINER_CHECK: whether the periodic timer runs host checks, container checks, or both
REMOTE_BOOT_TEST_IMAGE_REPOSITORY, REMOTE_BOOT_TEST_IMAGE, REMOTE_BOOT_TEST_VERSION: the temporary health-check container image
REMOTE_BOOT_FARM_TARGETS, REMOTE_BOOT_LAB_TARGETS, REMOTE_BOOT_MAC_<TARGET>: what exists in each group and how to wake it
REMOTE_BOOT_SLACK_ENABLED, REMOTE_BOOT_SLACK_WEBHOOK_URL, REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM, REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB: whether real Slack alerts are sent and whether alerts route to a generic webhook, a FARM-specific webhook, or a LAB-specific webhook
REMOTE_BOOT_ALERT_STATE_DIR: where alert suppression state files are stored so the same failure is not sent repeatedly until it is cleared or auto-reset by a later success

Slack test

In config/remote_boot.local.env, set:

REMOTE_BOOT_SLACK_ENABLED=true
REMOTE_BOOT_SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."        # optional fallback
REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM="https://hooks.slack.com/services/..."   # FARM alerts
REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB="https://hooks.slack.com/services/..."    # LAB alerts

Send a test message:

./script/test_slack_notification.sh
./script/test_slack_notification.sh --server-id FARM1
./script/test_slack_notification.sh --server-id LAB1

You can also override the message text:

./script/test_slack_notification.sh --server-id FARM1 --message "remote_boot slack test (farm)"
./script/test_slack_notification.sh --server-id LAB1 --message "remote_boot slack test (lab)"

If Slack is configured, alerts containing FARM* server IDs go to REMOTE_BOOT_SLACK_WEBHOOK_URL_FARM, alerts containing LAB* server IDs go to REMOTE_BOOT_SLACK_WEBHOOK_URL_LAB, mixed alerts are sent to both when both are configured, and REMOTE_BOOT_SLACK_WEBHOOK_URL acts as a fallback.

Manual health-check logs:

./script/check_server_boot_health.sh --server-id FARM1

This keeps terminal output and also writes a per-run log under logs/health/ by default. Use --log-file /path/to/file.log to override the destination.

Log format:

2026-03-11T15:10:00+0900 [HEALTH] context=check_server_boot_health server=FARM1 stage=mount_check required_mount=...

Git

config/remote_boot.local.env is ignored by .gitignore
commit config/remote_boot.example.env and keep real server-specific values, including MAC addresses, only in config/remote_boot.local.env
when a server is added, update REMOTE_BOOT_FARM_TARGETS or REMOTE_BOOT_LAB_TARGETS plus the matching REMOTE_BOOT_MAC_<TARGET> value in config/remote_boot.local.env

Notes

wakeonlan must be installed on this desktop.
wake_targets.sh reads MAC addresses from REMOTE_BOOT_MAC_<TARGET> variables in config/remote_boot.local.env.
LAB* targets use 192.168.1.255, and FARM* targets use 192.168.2.255 by default.
remote scripts can use REMOTE_BOOT_ANSIBLE_INVENTORY, or fall back to your existing ansible.cfg default inventory.
host mount checks expect 100.100.100.100:/294t/dcloud/share for LAB and 100.100.100.120:/volume1/share for FARM.
host NFS remount recovery uses REMOTE_BOOT_HOST_SHARE_MOUNT_TEMPLATE, which defaults to /home/tako%s/share.
automatic recovery commands use sudo -n on the remote hosts; if passwordless sudo is not available there, recovery will not run and the failure will fall through to the alert stub log.
boot health checks create a temporary GPU test container directly via Docker and remove it without writing to the DB.
health-check runs can write per-run logs to REMOTE_BOOT_HEALTH_LOG_DIR when REMOTE_BOOT_ENABLE_HEALTH_LOGGING=true.
service and orchestration logs use an ISO timestamp plus tag format like [BOOT], [GATE], [HEALTH], [WAKE], [CONTAINER], and [SMOKE].
unrecovered failures are written to REMOTE_BOOT_ALERT_STUB_LOG_FILE when Slack is disabled or Slack delivery fails.
test container share mounts can use REMOTE_BOOT_TEST_SHARE_SOURCE_TEMPLATE="/home/tako%s/share/user-share/"; %s is replaced with the server number, so FARM1 and LAB1 both use /home/tako1/share/user-share/.
test container GPU launch uses REMOTE_BOOT_TEST_DOCKER_RUNTIME="auto" by default, so hosts without a registered nvidia runtime still run with --gpus.
post-boot container handling starts only containers that are in a stopped state; after that, each container is checked for SSH, but GPU checks run only for containers whose image is decs or dguailab/decs with any tag. CPU-only containers are logged as skipped.
post-restart per-container checks use REMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_TIMEOUT_SECONDS and REMOTE_BOOT_CONTAINER_POST_RESTART_CHECK_POLL_SECONDS.
If the network is not ready at boot, increase REMOTE_BOOT_PRE_DELAY_SECONDS.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
script		script
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Remote Boot

Files

Quick start

Manual usage

Config guide

Slack test

Git

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Remote Boot

Files

Quick start

Manual usage

Config guide

Slack test

Git

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages