Skip to content

SkypilotExecutor cannot track status or logs for API-server-managed Kubernetes jobs #482

@jesintharnold

Description

@jesintharnold

Summary

When launching a job with run.SkypilotExecutor against a remote SkyPilot API server on Kubernetes, the pod comes up and the workload runs, but NeMo Run cannot continue tracking it. After launch, local status/log calls report Cluster(s) not found, nemo experiment status stays at SUBMITTED, and nemo experiment logs does not provide the running job logs.

Environment

  • nemo-run version: please fill exact version from pip show nemo-run
  • skypilot version: please fill exact version from pip show skypilot
  • Python: 3.11.9
  • Backend: SkyPilot API server + Kubernetes
  • SKYPILOT_API_SERVER_ENDPOINT: <skypilot api endpoint>

Minimal Reproducer

import os
os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"

import nemo_run as run
from nemo.collections import llm
import nemo.lightning as nl
from lightning.pytorch.loggers import MLFlowLogger
from nemo.collections.llm.peft.lora import LoRA
from lightning.pytorch.callbacks import EarlyStopping

NEMO_MODEL_PATH = "/mnt/models/nemo-models/Mistral-7B-v0.3"
DATASET_ROOT = "/mnt/datasets/kyc-edd-v1"
experiment_name="kyc-edd-mistral-lora-ft-exp4"
run_name="kyc-edd-mistral-lora-ft-exp4-lora-r32a64-100steps"
OUTPUT_DIR = "/mnt/experiments/kyc-edd-mistral-lora-ft-exp4"
LOG_DIR = "/mnt/experiments/kyc_edd_mistral_lora_ft_logs-exp4"
MLFLOW_TRACKING_URI = "<ML-FLOW-URL>"

def configure_recipe(nodes: int = 1, gpus_per_node: int = 4):
    recipe = llm.mistral_7b.finetune_recipe(
        dir=OUTPUT_DIR,
        name="mistral_lora",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        peft_scheme="lora",
    )

    recipe.resume = run.Config(
        nl.AutoResume,
        restore_config=run.Config(
            nl.RestoreConfig,
            path=NEMO_MODEL_PATH,
        ),
        resume_if_exists=True,
    )

    recipe.data = run.Config(
        llm.FineTuningDataModule,
        dataset_root=DATASET_ROOT,
        seq_length=4096,
        micro_batch_size=1,
        global_batch_size=64,
    )


    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=100,
        save_weights_only=False,
        always_save_context=True,        # save context/ on each checkpoint
        save_context_on_train_end=True,  # also save on final checkpoint
    )

    recipe.log = run.Config(
        nl.NeMoLogger,
        name="mistral-lora-ft",
        log_dir=LOG_DIR,
        use_datetime_version=False,
        ckpt=ckpt,
        explicit_log_dir=LOG_DIR,
        extra_loggers=[
            run.Config(
                MLFlowLogger,
                experiment_name=experiment_name,
                run_name=run_name,
                tracking_uri=MLFLOW_TRACKING_URI,
                log_model=False,
            )
        ],
    )
    recipe.peft = run.Config(
                LoRA,
                target_modules=[   'linear_qkv',
                                     'linear_proj',
                                     'linear_fc1',
                                     'linear_fc2'],
                 exclude_modules=[],
                 dim=32,
                 alpha=64,
                 dropout=0.05,
                 dropout_position='pre',
                 lora_A_init_method='xavier',
                 lora_B_init_method='zero',
                 a2a_experimental=False,
                 lora_dtype=None,
                 dropout_recompute=False
              )

    early_stop = run.Config(
        EarlyStopping,
        monitor="val_loss",   # must be a metric that is actually logged
        mode="min",           # lower val_loss is better
        patience=3,           # 3 validation checks without improvement
        min_delta=0.0,
        strict=True,
        verbose=True,
    )
    recipe.trainer.max_steps = 1000
    recipe.trainer.num_sanity_val_steps = 0
    recipe.trainer.val_check_interval = 5
    recipe.trainer.strategy.ckpt_async_save = False
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.trainer.strategy.ddp = "megatron"
    #recipe.log.ckpt.save_context_on_train_end=True
    if recipe.trainer.callbacks is None:
        recipe.trainer.callbacks = []

    recipe.trainer.callbacks.append(early_stop)
    return recipe


def skypilot_executor(nodes: int = 1, gpus_per_node: int = 4) -> run.SkypilotExecutor:
    return run.SkypilotExecutor(
        gpus="H100",
        gpus_per_node=gpus_per_node,
        num_nodes=nodes,
        cloud="kubernetes",
        container_image="nvcr.io/nvidia/nemo:25.07",
        cluster_name="kyc-edd-mistral-finetune",
        setup="pip install mlflow>=1.0.0",
        env_vars={
            "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
            "NCCL_NVLS_ENABLE": "0",
            "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
            "NVTE_ASYNC_AMAX_REDUCTION": "1",
            "CUDA_DEVICE_MAX_CONNECTIONS": "1",
            "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
            "MLFLOW_TRACKING_URI": MLFLOW_TRACKING_URI,
        },
    )


def finetune_mistral():
    nodes = 1
    gpus_per_node = 4

    recipe = configure_recipe(nodes=nodes, gpus_per_node=gpus_per_node)
    executor = skypilot_executor(nodes=nodes, gpus_per_node=gpus_per_node)

    with run.Experiment("kyc-edd-mistral-7b-peft-finetuning-exp4") as exp:
        exp.add(recipe, executor=executor, name="kyc_edd_mistral_peft_finetuning-exp4")
        exp.run(sequential=True, tail_logs=False)


if __name__ == "__main__":
    finetune_mistral()

Observed Output

/nemo-run/mistral-7B-PEFT# python ./mistral-mlflow-finetune.py
[NeMo W 2026-04-06 05:16:36 nemo_logging:364] /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
      warn(

INFO:numexpr.utils:NumExpr defaulting to 12 threads.
INFO:megatron.core.msc_utils:The multistorageclient package is available.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.mixed_precision:Using Megatron-FSDP without Transformer Engine.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:Detected Megatron Core, using Megatron-FSDP with Megatron.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.megatron_fsdp:Detected Megatron Core, using Megatron-FSDP with Megatron.
WARNING:nv_one_logger.api.config:OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
INFO:nv_one_logger.exporter.export_config_manager:Final configuration contains 0 exporter(s)
WARNING:nv_one_logger.training_telemetry.api.training_telemetry_provider:No exporters were provided. This means that no telemetry data will be collected.
WARNING:nemo.collections.llm.gpt.model.megatron.hyena.hyena_mixer:WARNING: transformer_engine not installed. Using default recipe.
[NeMo W 2026-04-06 05:17:15 __init__:442] The deploy module could not be imported: cannot import name 'deploy' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
[NeMo W 2026-04-06 05:17:15 __init__:449] The evaluate module could not be imported: cannot import name 'evaluate' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
────── Entering Experiment kyc-edd-mistral-7b-peft-finetuning-exp4 with id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ──────
[05:17:15] Launching job kyc_edd_mistral_peft_finetuning-exp4 for experiment                                        experiment.py:798
           kyc-edd-mistral-7b-peft-finetuning-exp4
Running on cluster: kyc-edd-mistral-finetune
⚙︎ Uploading files to API server
✓ Files uploaded  View logs: ~/sky_logs/file_uploads/sky-2026-04-06-05-18-35-880552-6f6b193d.log
Considered resources (1 node):
-------------------------------------------------------------------------------------
 INFRA                     INSTANCE   vCPUs   Mem(GB)   GPUS     COST ($)   CHOSEN
-------------------------------------------------------------------------------------
 Kubernetes (in-cluster)   -          16      64        H100:4   0.00          ✔
-------------------------------------------------------------------------------------
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: kyc-edd-mistral-finetune.  View logs: sky logs --provision kyc-edd-mistral-finetune
⚙︎ Syncing files.
  Syncing (to 1 node): /root/.sky/api_server/clients/f04b916e/file_mounts/root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4 -> ~/.sky/file_mounts/nemo_run
✓ Synced file_mounts.  View logs: sky api logs -l sky-2026-04-06-05-18-39-436481/file_mounts.log
✓ Setup detached.
⚙︎ Job submitted, ID: 1
Cluster(s) not found: kyc-edd-mistral-finetune.
[05:21:57] INFO     Launched app:                                                                                     launcher.py:116
                    skypilot://nemo_run/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune
                    ___kyc_edd_mistral_peft_finetuning-exp4___1
──────────────────────── Waiting for Experiment kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 to finish ────────────────────────
Cluster(s) not found: kyc-edd-mistral-finetune.
Cluster(s) not found: kyc-edd-mistral-finetune.

Experiment Status for kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635

Task 0: kyc_edd_mistral_peft_finetuning-exp4
- Status: SUBMITTED
- Executor: SkypilotExecutor
- Job id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_peft_finetuning-exp4___1
- Local Directory: /root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4

[05:21:58] INFO     Waiting for job                                                                                   launcher.py:136
                    kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_p
                    eft_finetuning-exp4___1 to finish [log=False]...
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
│   kyc_edd_mistral_peft_finetuning-exp4 UNKNOWN   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:10   │
│                                                                                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# The experiment was run with the following tasks: ['kyc_edd_mistral_peft_finetuning-exp4']
# You can inspect and reconstruct this experiment at a later point in time using:
experiment = run.Experiment.from_id("kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635")
experiment.status() # Gets the overall status
experiment.logs("kyc_edd_mistral_peft_finetuning-exp4") # Gets the log for the provided task
experiment.cancel("kyc_edd_mistral_peft_finetuning-exp4") # Cancels the provided task if still running


# You can inspect this experiment at a later point in time using the CLI as well:
nemo experiment status kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635
nemo experiment logs kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0
nemo experiment cancel kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0

At the same time, the Kubernetes pod is actually running, and the training job proceeds normally when checked via kubectl / k9s. Running pods screenshot from AKS is given below for reference

Image

Expected Behavior

  • NeMo Run should be able to track the job after submission.
  • nemo experiment status should reflect the real runtime state.
  • nemo experiment logs should stream or retrieve logs for the running job.

Actual Behavior

  • The workload runs on Kubernetes, but NeMo Run prints Cluster(s) not found.
  • Experiment state remains SUBMITTED instead of transitioning to RUNNING / terminal states.
  • Logs are not available via NeMo Run and are only visible via kubectl logs or the Kubernetes UI.
  • The local sky status / sky logs workflow also does not reflect these API-server-managed jobs.

Request

Please confirm whether SkypilotExecutor is expected to support status/log tracking when jobs are launched through a remote SkyPilot API server on Kubernetes.

If yes, this looks like a NeMo Run integration bug.

If no, the limitation should be documented clearly, including the recommended way to monitor logs and status for this mode.

Additional Question

Is there any extra SkyPilot configuration required for Kubernetes-based jobs to show logs in the SkyPilot dashboard, or is dashboard log visibility currently unsupported for this execution path?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions