feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685)#1436
feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685)#1436spomichter wants to merge 16 commits intodevfrom
Conversation
Greptile SummaryThis PR introduces the full daemon infrastructure for the DimOS CLI: double-fork daemonization ( Key findings from the review:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant CLI as dimos CLI
participant Reg as RunRegistry
participant Coord as ModuleCoordinator
participant Daemon as Daemon (grandchild)
User->>CLI: dimos run --daemon <blueprint>
CLI->>Reg: cleanup_stale()
CLI->>Reg: check_port_conflicts()
CLI->>Coord: blueprint.build()
CLI->>Coord: health_check(timeout=30s)
Coord-->>CLI: workers alive ✓
CLI->>User: ✓ All modules started / Health check passed
CLI->>CLI: daemonize() — double fork
Note over CLI: Parent 1 exits (os._exit)<br/>Parent 2 exits (os._exit)
CLI->>Daemon: grandchild continues
Daemon->>Daemon: stdin→/dev/null, stdout/stderr→daemon.log
Daemon->>Reg: RunEntry.save() (pid=getpid())
Daemon->>Daemon: install_signal_handlers()
Daemon->>Coord: coordinator.loop()
User->>CLI: dimos status
CLI->>Reg: list_runs(alive_only=True)
Reg-->>CLI: [RunEntry, ...]
CLI->>User: Run ID / PID / Uptime / Log
User->>CLI: dimos stop
CLI->>Reg: get_most_recent()
CLI->>Daemon: SIGTERM
alt Process dies within 5s
Daemon->>Reg: entry.remove() (via signal handler)
else Still alive after 5s
CLI->>Daemon: SIGKILL
CLI->>Reg: entry.remove()
end
CLI->>User: Stopped.
Last reviewed commit: 7446f75 |
dimos/robot/cli/dimos.py
Outdated
| typer.echo("✓ Health check passed") | ||
| typer.echo("✓ DimOS running in background\n") | ||
| typer.echo(f" Run ID: {run_id}") | ||
| typer.echo(f" PID: {os.getpid()}") |
There was a problem hiding this comment.
Daemon PID displayed before fork is incorrect
os.getpid() is called before daemonize() on line 179, so it prints the original parent process PID. After daemonize(), two fork() calls occur and both intermediate parents call os._exit(0). The grandchild (actual daemon) has a completely different PID. Users who try to kill <shown_pid> or use this PID for monitoring will be targeting a process that has already exited.
The registry entry on line 183 correctly records os.getpid() after daemonize(), so the fix is to echo the PID after daemonizing (or print a message telling users to use dimos status instead):
typer.echo(f"✓ All modules started ({n_modules} modules, {n_workers} workers)")
typer.echo("✓ Health check passed")
daemonize(log_dir)
entry = RunEntry(
run_id=run_id,
pid=os.getpid(),
...
)
entry.save()
# Now os.getpid() is the actual daemon PID — but stdout is already redirected,
# so communicate the PID via the registry and have the parent print it.Since both stdout and stderr are redirected inside daemonize(), the cleanest fix is to print all the info (including PID) after the double-fork, which requires the grandchild to communicate back to the parent (e.g. via a pipe), or to print the run-id before the fork and let the user run dimos status to get the PID.
dimos/robot/cli/dimos.py
Outdated
| typer.echo("✓ DimOS running in background\n") | ||
| typer.echo(f" Run ID: {run_id}") | ||
| typer.echo(f" PID: {os.getpid()}") | ||
| typer.echo(f" Log: {log_dir}/dimos.jsonl") |
There was a problem hiding this comment.
Log path shown to user does not exist
The path {log_dir}/dimos.jsonl is displayed as the log location, but this file is never created. The actual structured log file is written by logging_config.py at _get_log_directory() / f"dimos_{timestamp}_{pid}.jsonl" — a completely different directory and filename pattern.
For an installed package this resolves to ~/.local/state/dimos/logs/dimos_<ts>_<pid>.jsonl, not to the per-run log_dir. The daemonize() function only creates log_dir/stderr.log for redirected file descriptors. Pointing users to a non-existent path will make dimos log (once implemented) confusing or broken.
- close devnull/stderr_file after dup2 (fd leak) - remove PID from pre-fork output (printed parent PID, not daemon PID) - show log_dir not log_dir/dimos.jsonl (file doesn't exist yet) - keep tests in tests/ (dimos/conftest.py breaks isolated tests)
dimos status — shows running instances with run-id, pid, blueprint, uptime, log dir dimos stop — sends SIGTERM (or SIGKILL with --force), waits 5s, escalates if needed --pid to target specific instance, --all to stop everything cleans registry on stop also adds get_most_recent() to run_registry. 8 new tests covering sigterm, sigkill, escalation, dead process cleanup, most-recent lookup.
lightweight 2-module blueprint (PingModule + PongModule) that needs no hardware, no LFS data, and no replay files. tests real forkserver workers and module deployment. covers: - single worker lifecycle (build -> health check -> registry -> stop) - multiple workers (2 workers, both alive) - health check detects dead worker (SIGKILL -> detect failure) - registry entry JSON field roundtrip - stale entry cleanup (dead PIDs removed, live entries kept)
both stdout and stderr redirect to the same file after daemonize(), so stderr.log was misleading. daemon.log better describes the contents.
folds DIM-685 into daemon PR to avoid merge conflicts on dimos.py. - set_run_log_dir() before blueprint.build() routes structlog to <log_base>/<run-id>/main.jsonl - workers inherit DIMOS_RUN_LOG_DIR env var via forkserver - FileHandler replaces RotatingFileHandler (multi-process rotation unsafe) - fallback: env var check -> legacy per-pid files - 6 unit tests for routing logic
setup_logger() runs at import time throughout the codebase, creating FileHandlers pointing to the legacy log path. set_run_log_dir() was resetting the global path but not updating these existing handlers, so main.jsonl was created but stayed empty (0 bytes). fix: iterate all stdlib loggers and redirect their FileHandlers to the new per-run path. verified: main.jsonl now receives structured JSON logs (1050 bytes, 5 lines in test run).
testpaths in pyproject.toml is ['dimos'], so tests/ at repo root wouldn't be picked up by CI. moved all 3 test files to dimos/core/ alongside existing core tests. all 41 tests pass with conftest active.
matches convention from test_worker.py — forkserver-based tests are marked slow so they run in CI but are skipped in local default pytest. local default: 36 tests (unit only) CI (-m 'not (tool or mujoco)'): 41 tests (unit + e2e)
|
@greptile |
…682, DIM-684) 16 tests using typer CliRunner with real subprocess PIDs: status (7 tests): - no instances, running instance details, uptime formatting - MCP port, multiple instances, dead PID filtering stop (9 tests): - default most recent, --pid, --pid not found - --all, --all empty, --force SIGKILL - already-dead cleanup, SIGTERM verification
dimos/core/test_e2e_daemon.py
Outdated
| @pytest.fixture(autouse=True) | ||
| def _clean_registry(): | ||
| """Remove any leftover registry entries before/after each test.""" | ||
| for f in REGISTRY_DIR.glob("*.json"): | ||
| f.unlink(missing_ok=True) | ||
| yield | ||
| for f in REGISTRY_DIR.glob("*.json"): | ||
| f.unlink(missing_ok=True) |
There was a problem hiding this comment.
_clean_registry deletes entries from the real user registry
REGISTRY_DIR is imported directly from run_registry and resolves to ~/.local/state/dimos/runs. The autouse fixture therefore deletes all *.json files in the real registry both before and after every e2e test — including entries for any dimos instance actually running on the developer's machine at the time.
All unit tests in test_daemon.py correctly avoid this by monkeypatching REGISTRY_DIR to a tmp_path. This fixture should do the same:
@pytest.fixture(autouse=True)
def _clean_registry(tmp_path, monkeypatch):
"""Redirect registry to a temp dir and clean up after each test."""
import dimos.core.run_registry as _reg
monkeypatch.setattr(_reg, "REGISTRY_DIR", tmp_path / "runs")
(tmp_path / "runs").mkdir()
yield
# monkeypatch restores the original REGISTRY_DIR automaticallyWithout this fix, running the e2e suite on a machine with a live dimos daemon will silently evict the daemon's registry entry, causing dimos status and dimos stop to lose track of it.
| def test_creates_directory(self, tmp_path): | ||
| from dimos.utils.logging_config import set_run_log_dir | ||
|
|
||
| log_dir = tmp_path / "run-001" | ||
| set_run_log_dir(log_dir) | ||
| assert log_dir.is_dir() | ||
|
|
||
| def test_sets_env_var(self, tmp_path): | ||
| from dimos.utils.logging_config import set_run_log_dir | ||
|
|
||
| log_dir = tmp_path / "run-002" | ||
| set_run_log_dir(log_dir) | ||
| assert os.environ["DIMOS_RUN_LOG_DIR"] == str(log_dir) | ||
|
|
||
| def test_get_run_log_dir_returns_path(self, tmp_path): | ||
| from dimos.utils.logging_config import get_run_log_dir, set_run_log_dir | ||
|
|
||
| log_dir = tmp_path / "run-003" | ||
| set_run_log_dir(log_dir) | ||
| assert get_run_log_dir() == log_dir |
There was a problem hiding this comment.
Module-level globals not reset between tests, causing state bleed
set_run_log_dir() modifies two module-level globals in logging_config: _RUN_LOG_DIR and _LOG_FILE_PATH. The _clean_env fixture only resets the DIMOS_RUN_LOG_DIR environment variable, leaving these globals polluted between tests.
TestLogFilePathRouting.test_routes_via_env_var and test_falls_back_to_legacy correctly use monkeypatch.setattr(logging_config, "_RUN_LOG_DIR", None) before relying on env-var or fallback behaviour. But test_creates_directory, test_sets_env_var, and test_get_run_log_dir_returns_path all call set_run_log_dir() without ever resetting those globals, so whichever test runs last will leave a non-None _RUN_LOG_DIR that affects subsequent test modules.
The _clean_env fixture should also reset the module-level state:
@pytest.fixture(autouse=True)
def _clean_env(monkeypatch):
"""Remove DIMOS_RUN_LOG_DIR from env and reset module globals between tests."""
from dimos.utils import logging_config
monkeypatch.delenv("DIMOS_RUN_LOG_DIR", raising=False)
monkeypatch.setattr(logging_config, "_RUN_LOG_DIR", None)
monkeypatch.setattr(logging_config, "_LOG_FILE_PATH", None)| else: | ||
| typer.echo(" Still alive after 5s, sending SIGKILL...") | ||
| try: | ||
| os.kill(entry.pid, signal.SIGKILL) | ||
| except ProcessLookupError: | ||
| pass | ||
|
|
||
| entry.remove() | ||
| typer.echo(" Stopped.") |
There was a problem hiding this comment.
No confirmation wait after SIGKILL escalation
After sending SIGKILL (line 316), the code falls straight through to entry.remove() and prints "Stopped." with no delay. SIGKILL delivery is immediate in the kernel, but the process remains in the process table (potentially as a zombie) until its parent calls wait(). In the brief window between SIGKILL and the kernel fully reaping the process, a subsequent dimos run attempt that calls check_port_conflicts() → is_pid_alive() → os.kill(pid, 0) can still return True for the supposedly-stopped PID, producing a spurious port-conflict error.
A short poll (matching the existing pattern) would close this window:
try:
os.kill(entry.pid, signal.SIGKILL)
except ProcessLookupError:
pass
else:
# Wait briefly for SIGKILL to be processed
for _ in range(20): # up to 2 seconds
if not is_pid_alive(entry.pid):
break
time.sleep(0.1)
entry.remove()
typer.echo(" Stopped.")
dimos/utils/logging_config.py
Outdated
| for logger_name in list(logging.Logger.manager.loggerDict): | ||
| logger_obj = logging.getLogger(logger_name) | ||
| for handler in logger_obj.handlers: | ||
| if isinstance(handler, logging.FileHandler) and handler.baseFilename != str(new_path): | ||
| handler.close() | ||
| handler.baseFilename = str(new_path) | ||
| handler.stream = handler._open() |
There was a problem hiding this comment.
Handler migration uses private _open() method
handler._open() is a CPython implementation detail of logging.FileHandler — the leading underscore marks it as internal. While it has been stable across Python versions this works against in practice, relying on private APIs makes the code fragile against future changes.
The public way to reopen a FileHandler to a new path is to close it, set baseFilename, and then let the handler reopen itself on the next emit() call (which calls _open() internally via StreamHandler.emit). Alternatively, construct a fresh FileHandler and swap it into the logger's handler list. Either approach avoids the private-API dependency.
…684) 3 new e2e tests that exercise dimos status and stop against a live PingPong blueprint with real forkserver workers: - status shows live blueprint details (run_id, PID, blueprint name) - registry entry findable after real build, workers alive - coord.stop() kills workers, registry cleaned
| from datetime import datetime, timezone | ||
| import os |
There was a problem hiding this comment.
System imports should be at the top.
dimos/robot/cli/dimos.py
Outdated
| from dimos.utils.logging_config import setup_exception_handler | ||
|
|
||
| setup_exception_handler() | ||
| logger.info("Starting DimOS") |
There was a problem hiding this comment.
This line was before anything because imports take time and we want to look responsive.
dimos/robot/cli/dimos.py
Outdated
| typer.echo("✓ DimOS running in background\n") | ||
| typer.echo(f" Run ID: {run_id}") | ||
| typer.echo(f" Log: {log_dir}") | ||
| typer.echo(" MCP: http://localhost:9990/mcp\n") |
There was a problem hiding this comment.
MCP isn't necessarily started and could be at a different host/port.
dimos/robot/cli/dimos.py
Outdated
| n_workers = len(coordinator._client.workers) if coordinator._client else 0 | ||
| n_modules = len(coordinator._deployed_modules) |
There was a problem hiding this comment.
You shouldn't access internal variables. There's this on WorkerManger, you probably need to add it to coordinator:
@property
def workers(self) -> list[Worker]:
return list(self._workers)
dimos/robot/cli/dimos.py
Outdated
| typer.echo("No running DimOS instances") | ||
| return | ||
|
|
||
| for entry in entries: |
There was a problem hiding this comment.
We don't support running multiple instances.
dimos/core/run_registry.py
Outdated
| return removed | ||
|
|
||
|
|
||
| def check_port_conflicts(mcp_port: int = 9990, grpc_port: int = 9877) -> RunEntry | None: |
There was a problem hiding this comment.
We have many places where we hardcode 9990 as the MCP port. It should probably be in a single place: probably GlobalConfig. (We probably also need a host which can be localhost as the default.)
I think all of these default values should be removed, and only have one place where we provide them.
dimos/core/test_cli_stop_status.py
Outdated
| _entry("mcp-port-test", proc.pid, mcp_port=9990) | ||
|
|
||
| result = CliRunner().invoke(main, ["status"]) | ||
| assert "9990" in result.output |
There was a problem hiding this comment.
Since 9990 is the default it's more useful to test a random port like 24623
dimos/core/test_cli_stop_status.py
Outdated
| from typer.testing import CliRunner | ||
|
|
||
| from dimos.robot.cli.dimos import main | ||
|
|
There was a problem hiding this comment.
These imports should be at the top because they're repeated in every test.
dimos/core/daemon.py
Outdated
| if _active_coordinator is not None: | ||
| try: | ||
| _active_coordinator.stop() | ||
| except Exception: | ||
| logger.error("Error during coordinator stop", exc_info=True) | ||
| if _active_entry is not None: | ||
| _active_entry.remove() |
There was a problem hiding this comment.
There's no need to check for None and use global variables if you inline _shutdown_handler into install_signal_handlers
| REGISTRY_DIR = Path.home() / ".local" / "state" / "dimos" / "runs" | ||
| LOG_BASE_DIR = Path.home() / ".local" / "state" / "dimos" / "logs" |
There was a problem hiding this comment.
We should rely on the XDG_DATA_HOME standard. This is used in _get_user_data_dir in dimos/utils/data.py. You should probably extract that function and use it:
REGISTRY_DIR = get_user_data_dir() / "runs"
| pass | ||
|
|
||
|
|
||
| def _entry(run_id: str, pid: int, blueprint: str = "test", **kwargs) -> RunEntry: |
There was a problem hiding this comment.
This should be a fixture so you can clean it up after it's done.
- move system imports to top of run(), logger.info before heavy imports - remove hardcoded MCP port line from daemon output - add n_workers/n_modules properties to ModuleCoordinator (public API) - single-instance model: remove --all/--pid from stop, simplify status - use _get_user_data_dir() for XDG-compliant registry/log paths - remove mcp_port from RunEntry (should be in GlobalConfig) - inline _shutdown_handler as closure in install_signal_handlers - add SIGKILL wait poll (2s) to avoid zombie race with port conflict check - replace handler._open() private API with new FileHandler construction - fix e2e _clean_registry to monkeypatch REGISTRY_DIR - reset logging_config module globals in test fixture - move test imports to module level
| try: | ||
| p.kill() | ||
| p.wait(timeout=2) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
It's probably better to do p.kill() and p.wait in separate try-excepts or at least check if the process is open. At least one of the tests already killed the process.
dimos/core/test_per_run_logs.py
Outdated
| """set_run_log_dir() configures per-run logging.""" | ||
|
|
||
| def test_creates_directory(self, tmp_path): | ||
| from dimos.utils.logging_config import set_run_log_dir |
There was a problem hiding this comment.
Should be moved at the top because it's repeated in every test.
structlog FileHandler writes to main.jsonl — daemon.log only ever captured signal shutdown messages. no useful content.
dimos/core/daemon.py
Outdated
| (meaning the underlying process exited). It also fails if there are | ||
| zero workers (nothing to monitor). | ||
| """ | ||
| client = coordinator._client |
There was a problem hiding this comment.
This function accesses _client (WorkerManager) which is an indicator that this function actually belongs as a method to WorkerManger.
But I don't think we need the polling. .build() works synchronously, so every module should be started after it finished. I think we should only check once that all the workers are alive.
…port fixes mypy name-defined error from import reorganization
- simplify health_check to single alive check (build is synchronous) - remove --health-timeout flag (no longer polling) - add workers property to ModuleCoordinator (public API) - separate try-excepts for kill/wait in sleeper fixture cleanup - move repeated imports to top in test_per_run_logs
| """Remove registry entries for dead processes. Returns count removed.""" | ||
| REGISTRY_DIR.mkdir(parents=True, exist_ok=True) | ||
| removed = 0 | ||
| for f in REGISTRY_DIR.glob("*.json"): |
There was a problem hiding this comment.
You need sorted() or list() like in list_runs because glob() is a generator and mutating while iterating leads to issues.
| _stop_entry(entry, force=force) | ||
|
|
||
|
|
||
| def _stop_entry(entry, force: bool = False) -> None: # type: ignore[no-untyped-def] |
There was a problem hiding this comment.
I think this belongs in dimos.core.run_registry as well.
Also, please don't add # type: ignore.
| # Process should be dead | ||
| assert proc.poll() is not None | ||
|
|
||
| def test_stop_sigterm_kills_process(self, sleeper): |
There was a problem hiding this comment.
This takes 13 seconds. It should be marked with @pytest.mark.slow.
| time.sleep(6) | ||
| assert proc.poll() is not None |
There was a problem hiding this comment.
Instead of sleeping a fixed amount, you should poll for the condition to be true. If the condition is still false after the timeout, fail the test.
| result = CliRunner().invoke(main, ["stop"]) | ||
| assert result.exit_code == 1 | ||
|
|
||
| def test_stop_default_most_recent(self, sleeper): |
There was a problem hiding this comment.
This test takes 7 seconds. It should be marked with @pytest.mark.slow.
| assert result.exit_code == 0 | ||
| assert "Stopping" in result.output | ||
| assert "stop-default" in result.output | ||
| time.sleep(0.3) |
There was a problem hiding this comment.
I assume this is because the path isn't cleaned up right away. You should also poll here for it to be true. These fixed sleep commands often lead to flaky tests, especially in CI where they're slower.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Tests for DIM-681: daemon mode, run registry, and health checks.""" |
There was a problem hiding this comment.
| """Tests for DIM-681: daemon mode, run registry, and health checks.""" |
| conflict = check_port_conflicts(grpc_port=9877) | ||
| assert conflict is not None | ||
|
|
||
| def test_port_conflict_no_false_positive(self, tmp_registry: Path): |
There was a problem hiding this comment.
Isn't this the same test as above?
| return tmp_path | ||
|
|
||
|
|
||
| def _make_entry( |
There was a problem hiding this comment.
This should be a fixture so it can clean up afterwards regardless of success status.
| import pytest | ||
|
|
||
| # Skip sysctl interactive prompt in CI / headless | ||
| os.environ["CI"] = "1" |
There was a problem hiding this comment.
ಠ_ಠ
This makes all tests run as if in CI because it overrides the env at import time.
adds daemon infrastructure for dimos cli:
dimos run --daemon: double-fork daemonization with health check that polls worker.pid before backgroundingdimos stop: SIGTERM with escalation to SIGKILL after 5s. supports--force,--all,--piddimos status: lists running instances with uptime, PID, blueprint, MCP port<log_base>/<run-id>/main.jsonlvia env var inheritance through forkserver~/.local/state/dimos/runs/<run-id>.jsonwith port conflict detection and stale cleanup<log_dir>/daemon.logCloses DIM-681, DIM-682, DIM-684, DIM-685
tests: 57 total
@pytest.mark.slow)