Skip to content

feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685)#1436

Open
spomichter wants to merge 16 commits intodevfrom
feat/dim-681-daemon-mode
Open

feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685)#1436
spomichter wants to merge 16 commits intodevfrom
feat/dim-681-daemon-mode

Conversation

@spomichter
Copy link
Contributor

@spomichter spomichter commented Mar 6, 2026

adds daemon infrastructure for dimos cli:

  • dimos run --daemon: double-fork daemonization with health check that polls worker.pid before backgrounding
  • dimos stop: SIGTERM with escalation to SIGKILL after 5s. supports --force, --all, --pid
  • dimos status: lists running instances with uptime, PID, blueprint, MCP port
  • per-run log directory: all modules write to <log_base>/<run-id>/main.jsonl via env var inheritance through forkserver
  • run registry at ~/.local/state/dimos/runs/<run-id>.json with port conflict detection and stale cleanup
  • FileHandler replaces RotatingFileHandler (multi-process rotation unsafe with forkserver)
  • signal handlers clean registry on SIGTERM/SIGINT
  • daemon stdout/stderr captured to <log_dir>/daemon.log

Closes DIM-681, DIM-682, DIM-684, DIM-685

tests: 57 total

  • 30 unit tests (registry, health check, daemon, signal handlers)
  • 6 logging unit tests (set_run_log_dir, env var routing, legacy fallback)
  • 16 CLI integration tests (status + stop via typer CliRunner with real PIDs)
  • 5 e2e tests with real forkserver workers (marked @pytest.mark.slow)

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR introduces the full daemon infrastructure for the DimOS CLI: double-fork daemonization (dimos run --daemon), a JSON run-registry for tracking live processes, dimos stop with SIGTERM→SIGKILL escalation, and dimos status. It also routes per-run structured logs to <log_base>/<run-id>/main.jsonl via a DIMOS_RUN_LOG_DIR env var inherited by forkserver workers, and replaces RotatingFileHandler with plain FileHandler to avoid multi-process rotation hazards. The overall design is solid and well-tested (41 tests, unit + e2e).

Key findings from the review:

  • E2E test registry isolation bug (test_e2e_daemon.py): The _clean_registry autouse fixture globbs and deletes files from the real REGISTRY_DIR (~/.local/state/dimos/runs) instead of redirecting it to a tmp_path, as the unit tests do. Running the e2e suite on a machine with a live dimos daemon will silently destroy that daemon's registry entry.
  • Test global-state bleed (test_per_run_logs.py): The _clean_env fixture only resets DIMOS_RUN_LOG_DIR; it leaves _RUN_LOG_DIR and _LOG_FILE_PATH module-level globals set by set_run_log_dir() unreset between tests, which can cause ordering-dependent failures.
  • No wait after SIGKILL escalation (dimos.py): After escalating from SIGTERM to SIGKILL, _stop_entry removes the registry entry and prints "Stopped." immediately. In the brief window before the kernel reaps the process, a subsequent dimos run could get a spurious port-conflict error.
  • Private _open() usage (logging_config.py): Handler migration in set_run_log_dir() calls the private logging.FileHandler._open() method to reopen the stream after updating baseFilename.

Confidence Score: 3/5

  • Safe to merge for production code; the e2e test isolation bug should be fixed before running tests on machines with live dimos instances.
  • The production daemon, registry, CLI, and logging code are well-structured and correct. The confidence deduction comes from the test suite: the e2e fixture operates on the real user registry (destructive on dev machines), and per-run-log tests leave module globals dirty between runs. These are test reliability issues, not runtime bugs, but they lower overall confidence in the test coverage providing the safety net it's intended to.
  • dimos/core/test_e2e_daemon.py (real registry deletion) and dimos/core/test_per_run_logs.py (global state bleed) need attention before the test suite can be trusted in developer environments.

Important Files Changed

Filename Overview
dimos/core/daemon.py New file implementing double-fork daemonization, health-check polling, and SIGTERM/SIGINT signal handlers. Logic is sound; file descriptors are properly closed after dup2.
dimos/core/run_registry.py New run-registry module with CRUD for RunEntry JSON files, stale-PID cleanup, port-conflict detection, and alphabetical-sort-based get_most_recent. Implementation is clean and correct.
dimos/robot/cli/dimos.py Adds run --daemon, stop, and status CLI commands. Core flow is correct; minor issue: no brief wait after SIGKILL escalation before reporting "Stopped." and cleaning registry.
dimos/utils/logging_config.py Switches from RotatingFileHandler to FileHandler and adds per-run log directory routing via env var and module globals. Handler migration uses the private _open() method; otherwise solid.
dimos/core/test_e2e_daemon.py E2E tests with real forkserver workers. Critical isolation bug: _clean_registry fixture operates on the real ~/.local/state/dimos/runs directory instead of a temp path, destructively deleting live registry entries during test runs.
dimos/core/test_per_run_logs.py Tests for per-run log directory routing. The _clean_env autouse fixture only resets the env var but not the _RUN_LOG_DIR and _LOG_FILE_PATH module-level globals, causing state bleed between tests.
dimos/core/test_daemon.py Comprehensive unit tests for registry CRUD, health check, daemonize, signal handlers, and stop/status commands. Proper use of monkeypatching for isolation.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as dimos CLI
    participant Reg as RunRegistry
    participant Coord as ModuleCoordinator
    participant Daemon as Daemon (grandchild)

    User->>CLI: dimos run --daemon <blueprint>
    CLI->>Reg: cleanup_stale()
    CLI->>Reg: check_port_conflicts()
    CLI->>Coord: blueprint.build()
    CLI->>Coord: health_check(timeout=30s)
    Coord-->>CLI: workers alive ✓
    CLI->>User: ✓ All modules started / Health check passed
    CLI->>CLI: daemonize() — double fork
    Note over CLI: Parent 1 exits (os._exit)<br/>Parent 2 exits (os._exit)
    CLI->>Daemon: grandchild continues
    Daemon->>Daemon: stdin→/dev/null, stdout/stderr→daemon.log
    Daemon->>Reg: RunEntry.save() (pid=getpid())
    Daemon->>Daemon: install_signal_handlers()
    Daemon->>Coord: coordinator.loop()

    User->>CLI: dimos status
    CLI->>Reg: list_runs(alive_only=True)
    Reg-->>CLI: [RunEntry, ...]
    CLI->>User: Run ID / PID / Uptime / Log

    User->>CLI: dimos stop
    CLI->>Reg: get_most_recent()
    CLI->>Daemon: SIGTERM
    alt Process dies within 5s
        Daemon->>Reg: entry.remove() (via signal handler)
    else Still alive after 5s
        CLI->>Daemon: SIGKILL
        CLI->>Reg: entry.remove()
    end
    CLI->>User: Stopped.
Loading

Last reviewed commit: 7446f75

typer.echo("✓ Health check passed")
typer.echo("✓ DimOS running in background\n")
typer.echo(f" Run ID: {run_id}")
typer.echo(f" PID: {os.getpid()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daemon PID displayed before fork is incorrect

os.getpid() is called before daemonize() on line 179, so it prints the original parent process PID. After daemonize(), two fork() calls occur and both intermediate parents call os._exit(0). The grandchild (actual daemon) has a completely different PID. Users who try to kill <shown_pid> or use this PID for monitoring will be targeting a process that has already exited.

The registry entry on line 183 correctly records os.getpid() after daemonize(), so the fix is to echo the PID after daemonizing (or print a message telling users to use dimos status instead):

        typer.echo(f"✓ All modules started ({n_modules} modules, {n_workers} workers)")
        typer.echo("✓ Health check passed")

        daemonize(log_dir)

        entry = RunEntry(
            run_id=run_id,
            pid=os.getpid(),
            ...
        )
        entry.save()
        # Now os.getpid() is the actual daemon PID — but stdout is already redirected,
        # so communicate the PID via the registry and have the parent print it.

Since both stdout and stderr are redirected inside daemonize(), the cleanest fix is to print all the info (including PID) after the double-fork, which requires the grandchild to communicate back to the parent (e.g. via a pipe), or to print the run-id before the fork and let the user run dimos status to get the PID.

typer.echo("✓ DimOS running in background\n")
typer.echo(f" Run ID: {run_id}")
typer.echo(f" PID: {os.getpid()}")
typer.echo(f" Log: {log_dir}/dimos.jsonl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log path shown to user does not exist

The path {log_dir}/dimos.jsonl is displayed as the log location, but this file is never created. The actual structured log file is written by logging_config.py at _get_log_directory() / f"dimos_{timestamp}_{pid}.jsonl" — a completely different directory and filename pattern.

For an installed package this resolves to ~/.local/state/dimos/logs/dimos_<ts>_<pid>.jsonl, not to the per-run log_dir. The daemonize() function only creates log_dir/stderr.log for redirected file descriptors. Pointing users to a non-existent path will make dimos log (once implemented) confusing or broken.

- close devnull/stderr_file after dup2 (fd leak)
- remove PID from pre-fork output (printed parent PID, not daemon PID)
- show log_dir not log_dir/dimos.jsonl (file doesn't exist yet)
- keep tests in tests/ (dimos/conftest.py breaks isolated tests)
dimos status — shows running instances with run-id, pid, blueprint, uptime, log dir
dimos stop — sends SIGTERM (or SIGKILL with --force), waits 5s, escalates if needed
  --pid to target specific instance, --all to stop everything
  cleans registry on stop

also adds get_most_recent() to run_registry.
8 new tests covering sigterm, sigkill, escalation, dead process cleanup, most-recent lookup.
lightweight 2-module blueprint (PingModule + PongModule) that needs no
hardware, no LFS data, and no replay files. tests real forkserver workers
and module deployment.

covers:
- single worker lifecycle (build -> health check -> registry -> stop)
- multiple workers (2 workers, both alive)
- health check detects dead worker (SIGKILL -> detect failure)
- registry entry JSON field roundtrip
- stale entry cleanup (dead PIDs removed, live entries kept)
both stdout and stderr redirect to the same file after daemonize(),
so stderr.log was misleading. daemon.log better describes the contents.
folds DIM-685 into daemon PR to avoid merge conflicts on dimos.py.

- set_run_log_dir() before blueprint.build() routes structlog to
  <log_base>/<run-id>/main.jsonl
- workers inherit DIMOS_RUN_LOG_DIR env var via forkserver
- FileHandler replaces RotatingFileHandler (multi-process rotation unsafe)
- fallback: env var check -> legacy per-pid files
- 6 unit tests for routing logic
setup_logger() runs at import time throughout the codebase, creating
FileHandlers pointing to the legacy log path. set_run_log_dir() was
resetting the global path but not updating these existing handlers,
so main.jsonl was created but stayed empty (0 bytes).

fix: iterate all stdlib loggers and redirect their FileHandlers to
the new per-run path. verified: main.jsonl now receives structured
JSON logs (1050 bytes, 5 lines in test run).
testpaths in pyproject.toml is ['dimos'], so tests/ at repo root
wouldn't be picked up by CI. moved all 3 test files to dimos/core/
alongside existing core tests. all 41 tests pass with conftest active.
matches convention from test_worker.py — forkserver-based tests are
marked slow so they run in CI but are skipped in local default pytest.

local default: 36 tests (unit only)
CI (-m 'not (tool or mujoco)'): 41 tests (unit + e2e)
@spomichter spomichter changed the title feat(cli): add daemon mode for dimos run (DIM-681) feat(cli): daemon mode, stop, status, per-run logs (DIM-681, DIM-682, DIM-684, DIM-685) Mar 6, 2026
@spomichter
Copy link
Contributor Author

@greptile

…682, DIM-684)

16 tests using typer CliRunner with real subprocess PIDs:

status (7 tests):
- no instances, running instance details, uptime formatting
- MCP port, multiple instances, dead PID filtering

stop (9 tests):
- default most recent, --pid, --pid not found
- --all, --all empty, --force SIGKILL
- already-dead cleanup, SIGTERM verification
Comment on lines +98 to +105
@pytest.fixture(autouse=True)
def _clean_registry():
"""Remove any leftover registry entries before/after each test."""
for f in REGISTRY_DIR.glob("*.json"):
f.unlink(missing_ok=True)
yield
for f in REGISTRY_DIR.glob("*.json"):
f.unlink(missing_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_clean_registry deletes entries from the real user registry

REGISTRY_DIR is imported directly from run_registry and resolves to ~/.local/state/dimos/runs. The autouse fixture therefore deletes all *.json files in the real registry both before and after every e2e test — including entries for any dimos instance actually running on the developer's machine at the time.

All unit tests in test_daemon.py correctly avoid this by monkeypatching REGISTRY_DIR to a tmp_path. This fixture should do the same:

@pytest.fixture(autouse=True)
def _clean_registry(tmp_path, monkeypatch):
    """Redirect registry to a temp dir and clean up after each test."""
    import dimos.core.run_registry as _reg
    monkeypatch.setattr(_reg, "REGISTRY_DIR", tmp_path / "runs")
    (tmp_path / "runs").mkdir()
    yield
    # monkeypatch restores the original REGISTRY_DIR automatically

Without this fix, running the e2e suite on a machine with a live dimos daemon will silently evict the daemon's registry entry, causing dimos status and dimos stop to lose track of it.

Comment on lines +33 to +52
def test_creates_directory(self, tmp_path):
from dimos.utils.logging_config import set_run_log_dir

log_dir = tmp_path / "run-001"
set_run_log_dir(log_dir)
assert log_dir.is_dir()

def test_sets_env_var(self, tmp_path):
from dimos.utils.logging_config import set_run_log_dir

log_dir = tmp_path / "run-002"
set_run_log_dir(log_dir)
assert os.environ["DIMOS_RUN_LOG_DIR"] == str(log_dir)

def test_get_run_log_dir_returns_path(self, tmp_path):
from dimos.utils.logging_config import get_run_log_dir, set_run_log_dir

log_dir = tmp_path / "run-003"
set_run_log_dir(log_dir)
assert get_run_log_dir() == log_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module-level globals not reset between tests, causing state bleed

set_run_log_dir() modifies two module-level globals in logging_config: _RUN_LOG_DIR and _LOG_FILE_PATH. The _clean_env fixture only resets the DIMOS_RUN_LOG_DIR environment variable, leaving these globals polluted between tests.

TestLogFilePathRouting.test_routes_via_env_var and test_falls_back_to_legacy correctly use monkeypatch.setattr(logging_config, "_RUN_LOG_DIR", None) before relying on env-var or fallback behaviour. But test_creates_directory, test_sets_env_var, and test_get_run_log_dir_returns_path all call set_run_log_dir() without ever resetting those globals, so whichever test runs last will leave a non-None _RUN_LOG_DIR that affects subsequent test modules.

The _clean_env fixture should also reset the module-level state:

@pytest.fixture(autouse=True)
def _clean_env(monkeypatch):
    """Remove DIMOS_RUN_LOG_DIR from env and reset module globals between tests."""
    from dimos.utils import logging_config
    monkeypatch.delenv("DIMOS_RUN_LOG_DIR", raising=False)
    monkeypatch.setattr(logging_config, "_RUN_LOG_DIR", None)
    monkeypatch.setattr(logging_config, "_LOG_FILE_PATH", None)

Comment on lines +313 to +321
else:
typer.echo(" Still alive after 5s, sending SIGKILL...")
try:
os.kill(entry.pid, signal.SIGKILL)
except ProcessLookupError:
pass

entry.remove()
typer.echo(" Stopped.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No confirmation wait after SIGKILL escalation

After sending SIGKILL (line 316), the code falls straight through to entry.remove() and prints "Stopped." with no delay. SIGKILL delivery is immediate in the kernel, but the process remains in the process table (potentially as a zombie) until its parent calls wait(). In the brief window between SIGKILL and the kernel fully reaping the process, a subsequent dimos run attempt that calls check_port_conflicts()is_pid_alive()os.kill(pid, 0) can still return True for the supposedly-stopped PID, producing a spurious port-conflict error.

A short poll (matching the existing pattern) would close this window:

            try:
                os.kill(entry.pid, signal.SIGKILL)
            except ProcessLookupError:
                pass
            else:
                # Wait briefly for SIGKILL to be processed
                for _ in range(20):  # up to 2 seconds
                    if not is_pid_alive(entry.pid):
                        break
                    time.sleep(0.1)

    entry.remove()
    typer.echo("  Stopped.")

Comment on lines +60 to +66
for logger_name in list(logging.Logger.manager.loggerDict):
logger_obj = logging.getLogger(logger_name)
for handler in logger_obj.handlers:
if isinstance(handler, logging.FileHandler) and handler.baseFilename != str(new_path):
handler.close()
handler.baseFilename = str(new_path)
handler.stream = handler._open()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handler migration uses private _open() method

handler._open() is a CPython implementation detail of logging.FileHandler — the leading underscore marks it as internal. While it has been stable across Python versions this works against in practice, relying on private APIs makes the code fragile against future changes.

The public way to reopen a FileHandler to a new path is to close it, set baseFilename, and then let the handler reopen itself on the next emit() call (which calls _open() internally via StreamHandler.emit). Alternatively, construct a fresh FileHandler and swap it into the logger's handler list. Either approach avoids the private-API dependency.

…684)

3 new e2e tests that exercise dimos status and stop against a live
PingPong blueprint with real forkserver workers:

- status shows live blueprint details (run_id, PID, blueprint name)
- registry entry findable after real build, workers alive
- coord.stop() kills workers, registry cleaned
Comment on lines +111 to +112
from datetime import datetime, timezone
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System imports should be at the top.

from dimos.utils.logging_config import setup_exception_handler

setup_exception_handler()
logger.info("Starting DimOS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was before anything because imports take time and we want to look responsive.

typer.echo("✓ DimOS running in background\n")
typer.echo(f" Run ID: {run_id}")
typer.echo(f" Log: {log_dir}")
typer.echo(" MCP: http://localhost:9990/mcp\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCP isn't necessarily started and could be at a different host/port.

Comment on lines +172 to +173
n_workers = len(coordinator._client.workers) if coordinator._client else 0
n_modules = len(coordinator._deployed_modules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't access internal variables. There's this on WorkerManger, you probably need to add it to coordinator:

    @property
    def workers(self) -> list[Worker]:
        return list(self._workers)

typer.echo("No running DimOS instances")
return

for entry in entries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support running multiple instances.

return removed


def check_port_conflicts(mcp_port: int = 9990, grpc_port: int = 9877) -> RunEntry | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many places where we hardcode 9990 as the MCP port. It should probably be in a single place: probably GlobalConfig. (We probably also need a host which can be localhost as the default.)

I think all of these default values should be removed, and only have one place where we provide them.

Comment on lines +138 to +141
_entry("mcp-port-test", proc.pid, mcp_port=9990)

result = CliRunner().invoke(main, ["status"])
assert "9990" in result.output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 9990 is the default it's more useful to test a random port like 24623

Comment on lines +133 to +136
from typer.testing import CliRunner

from dimos.robot.cli.dimos import main

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports should be at the top because they're repeated in every test.

Comment on lines +150 to +156
if _active_coordinator is not None:
try:
_active_coordinator.stop()
except Exception:
logger.error("Error during coordinator stop", exc_info=True)
if _active_entry is not None:
_active_entry.remove()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to check for None and use global variables if you inline _shutdown_handler into install_signal_handlers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines +30 to +31
REGISTRY_DIR = Path.home() / ".local" / "state" / "dimos" / "runs"
LOG_BASE_DIR = Path.home() / ".local" / "state" / "dimos" / "logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rely on the XDG_DATA_HOME standard. This is used in _get_user_data_dir in dimos/utils/data.py. You should probably extract that function and use it:

REGISTRY_DIR = get_user_data_dir() / "runs"

pass


def _entry(run_id: str, pid: int, blueprint: str = "test", **kwargs) -> RunEntry:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a fixture so you can clean it up after it's done.

- move system imports to top of run(), logger.info before heavy imports
- remove hardcoded MCP port line from daemon output
- add n_workers/n_modules properties to ModuleCoordinator (public API)
- single-instance model: remove --all/--pid from stop, simplify status
- use _get_user_data_dir() for XDG-compliant registry/log paths
- remove mcp_port from RunEntry (should be in GlobalConfig)
- inline _shutdown_handler as closure in install_signal_handlers
- add SIGKILL wait poll (2s) to avoid zombie race with port conflict check
- replace handler._open() private API with new FileHandler construction
- fix e2e _clean_registry to monkeypatch REGISTRY_DIR
- reset logging_config module globals in test fixture
- move test imports to module level
Comment on lines +57 to +61
try:
p.kill()
p.wait(timeout=2)
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to do p.kill() and p.wait in separate try-excepts or at least check if the process is open. At least one of the tests already killed the process.

"""set_run_log_dir() configures per-run logging."""

def test_creates_directory(self, tmp_path):
from dimos.utils.logging_config import set_run_log_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved at the top because it's repeated in every test.

structlog FileHandler writes to main.jsonl — daemon.log only ever
captured signal shutdown messages. no useful content.
(meaning the underlying process exited). It also fails if there are
zero workers (nothing to monitor).
"""
client = coordinator._client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function accesses _client (WorkerManager) which is an indicator that this function actually belongs as a method to WorkerManger.

But I don't think we need the polling. .build() works synchronously, so every module should be started after it finished. I think we should only check once that all the workers are alive.

…port

fixes mypy name-defined error from import reorganization
- simplify health_check to single alive check (build is synchronous)
- remove --health-timeout flag (no longer polling)
- add workers property to ModuleCoordinator (public API)
- separate try-excepts for kill/wait in sleeper fixture cleanup
- move repeated imports to top in test_per_run_logs
"""Remove registry entries for dead processes. Returns count removed."""
REGISTRY_DIR.mkdir(parents=True, exist_ok=True)
removed = 0
for f in REGISTRY_DIR.glob("*.json"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need sorted() or list() like in list_runs because glob() is a generator and mutating while iterating leads to issues.

_stop_entry(entry, force=force)


def _stop_entry(entry, force: bool = False) -> None: # type: ignore[no-untyped-def]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this belongs in dimos.core.run_registry as well.

Also, please don't add # type: ignore.

# Process should be dead
assert proc.poll() is not None

def test_stop_sigterm_kills_process(self, sleeper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes 13 seconds. It should be marked with @pytest.mark.slow.

Comment on lines +177 to +178
time.sleep(6)
assert proc.poll() is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sleeping a fixed amount, you should poll for the condition to be true. If the condition is still false after the timeout, fail the test.

result = CliRunner().invoke(main, ["stop"])
assert result.exit_code == 1

def test_stop_default_most_recent(self, sleeper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test takes 7 seconds. It should be marked with @pytest.mark.slow.

assert result.exit_code == 0
assert "Stopping" in result.output
assert "stop-default" in result.output
time.sleep(0.3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is because the path isn't cleaned up right away. You should also poll here for it to be true. These fixed sleep commands often lead to flaky tests, especially in CI where they're slower.

# See the License for the specific language governing permissions and
# limitations under the License.

"""Tests for DIM-681: daemon mode, run registry, and health checks."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Tests for DIM-681: daemon mode, run registry, and health checks."""

conflict = check_port_conflicts(grpc_port=9877)
assert conflict is not None

def test_port_conflict_no_false_positive(self, tmp_registry: Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same test as above?

return tmp_path


def _make_entry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a fixture so it can clean up afterwards regardless of success status.

import pytest

# Skip sysctl interactive prompt in CI / headless
os.environ["CI"] = "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ಠ_ಠ

This makes all tests run as if in CI because it overrides the env at import time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants