SkypilotExecutor cannot track status or logs for API-server-managed Kubernetes jobs

## Summary

When launching a job with `run.SkypilotExecutor` against a remote SkyPilot API server on Kubernetes, the pod comes up and the workload runs, but NeMo Run cannot continue tracking it. After launch, local status/log calls report `Cluster(s) not found`, `nemo experiment status` stays at `SUBMITTED`, and `nemo experiment logs` does not provide the running job logs.

## Environment

- `nemo-run` version: please fill exact version from `pip show nemo-run`
- `skypilot` version: please fill exact version from `pip show skypilot`
- Python: 3.11.9
- Backend: SkyPilot API server + Kubernetes
- `SKYPILOT_API_SERVER_ENDPOINT`: `<skypilot api endpoint>`

## Minimal Reproducer

```python
import os
os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"

import nemo_run as run
from nemo.collections import llm
import nemo.lightning as nl
from lightning.pytorch.loggers import MLFlowLogger
from nemo.collections.llm.peft.lora import LoRA
from lightning.pytorch.callbacks import EarlyStopping

NEMO_MODEL_PATH = "/mnt/models/nemo-models/Mistral-7B-v0.3"
DATASET_ROOT = "/mnt/datasets/kyc-edd-v1"
experiment_name="kyc-edd-mistral-lora-ft-exp4"
run_name="kyc-edd-mistral-lora-ft-exp4-lora-r32a64-100steps"
OUTPUT_DIR = "/mnt/experiments/kyc-edd-mistral-lora-ft-exp4"
LOG_DIR = "/mnt/experiments/kyc_edd_mistral_lora_ft_logs-exp4"
MLFLOW_TRACKING_URI = "<ML-FLOW-URL>"

def configure_recipe(nodes: int = 1, gpus_per_node: int = 4):
    recipe = llm.mistral_7b.finetune_recipe(
        dir=OUTPUT_DIR,
        name="mistral_lora",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        peft_scheme="lora",
    )

    recipe.resume = run.Config(
        nl.AutoResume,
        restore_config=run.Config(
            nl.RestoreConfig,
            path=NEMO_MODEL_PATH,
        ),
        resume_if_exists=True,
    )

    recipe.data = run.Config(
        llm.FineTuningDataModule,
        dataset_root=DATASET_ROOT,
        seq_length=4096,
        micro_batch_size=1,
        global_batch_size=64,
    )


    ckpt = run.Config(
        nl.ModelCheckpoint,
        save_last=True,
        every_n_train_steps=100,
        save_weights_only=False,
        always_save_context=True,        # save context/ on each checkpoint
        save_context_on_train_end=True,  # also save on final checkpoint
    )

    recipe.log = run.Config(
        nl.NeMoLogger,
        name="mistral-lora-ft",
        log_dir=LOG_DIR,
        use_datetime_version=False,
        ckpt=ckpt,
        explicit_log_dir=LOG_DIR,
        extra_loggers=[
            run.Config(
                MLFlowLogger,
                experiment_name=experiment_name,
                run_name=run_name,
                tracking_uri=MLFLOW_TRACKING_URI,
                log_model=False,
            )
        ],
    )
    recipe.peft = run.Config(
                LoRA,
                target_modules=[   'linear_qkv',
                                     'linear_proj',
                                     'linear_fc1',
                                     'linear_fc2'],
                 exclude_modules=[],
                 dim=32,
                 alpha=64,
                 dropout=0.05,
                 dropout_position='pre',
                 lora_A_init_method='xavier',
                 lora_B_init_method='zero',
                 a2a_experimental=False,
                 lora_dtype=None,
                 dropout_recompute=False
              )

    early_stop = run.Config(
        EarlyStopping,
        monitor="val_loss",   # must be a metric that is actually logged
        mode="min",           # lower val_loss is better
        patience=3,           # 3 validation checks without improvement
        min_delta=0.0,
        strict=True,
        verbose=True,
    )
    recipe.trainer.max_steps = 1000
    recipe.trainer.num_sanity_val_steps = 0
    recipe.trainer.val_check_interval = 5
    recipe.trainer.strategy.ckpt_async_save = False
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.trainer.strategy.ddp = "megatron"
    #recipe.log.ckpt.save_context_on_train_end=True
    if recipe.trainer.callbacks is None:
        recipe.trainer.callbacks = []

    recipe.trainer.callbacks.append(early_stop)
    return recipe


def skypilot_executor(nodes: int = 1, gpus_per_node: int = 4) -> run.SkypilotExecutor:
    return run.SkypilotExecutor(
        gpus="H100",
        gpus_per_node=gpus_per_node,
        num_nodes=nodes,
        cloud="kubernetes",
        container_image="nvcr.io/nvidia/nemo:25.07",
        cluster_name="kyc-edd-mistral-finetune",
        setup="pip install mlflow>=1.0.0",
        env_vars={
            "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
            "NCCL_NVLS_ENABLE": "0",
            "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
            "NVTE_ASYNC_AMAX_REDUCTION": "1",
            "CUDA_DEVICE_MAX_CONNECTIONS": "1",
            "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
            "MLFLOW_TRACKING_URI": MLFLOW_TRACKING_URI,
        },
    )


def finetune_mistral():
    nodes = 1
    gpus_per_node = 4

    recipe = configure_recipe(nodes=nodes, gpus_per_node=gpus_per_node)
    executor = skypilot_executor(nodes=nodes, gpus_per_node=gpus_per_node)

    with run.Experiment("kyc-edd-mistral-7b-peft-finetuning-exp4") as exp:
        exp.add(recipe, executor=executor, name="kyc_edd_mistral_peft_finetuning-exp4")
        exp.run(sequential=True, tail_logs=False)


if __name__ == "__main__":
    finetune_mistral()

```

## Observed Output

```text
/nemo-run/mistral-7B-PEFT# python ./mistral-mlflow-finetune.py
[NeMo W 2026-04-06 05:16:36 nemo_logging:364] /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
      warn(

INFO:numexpr.utils:NumExpr defaulting to 12 threads.
INFO:megatron.core.msc_utils:The multistorageclient package is available.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.mixed_precision:Using Megatron-FSDP without Transformer Engine.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:Detected Megatron Core, using Megatron-FSDP with Megatron.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.megatron_fsdp:Detected Megatron Core, using Megatron-FSDP with Megatron.
WARNING:nv_one_logger.api.config:OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
INFO:nv_one_logger.exporter.export_config_manager:Final configuration contains 0 exporter(s)
WARNING:nv_one_logger.training_telemetry.api.training_telemetry_provider:No exporters were provided. This means that no telemetry data will be collected.
WARNING:nemo.collections.llm.gpt.model.megatron.hyena.hyena_mixer:WARNING: transformer_engine not installed. Using default recipe.
[NeMo W 2026-04-06 05:17:15 __init__:442] The deploy module could not be imported: cannot import name 'deploy' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
[NeMo W 2026-04-06 05:17:15 __init__:449] The evaluate module could not be imported: cannot import name 'evaluate' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
────── Entering Experiment kyc-edd-mistral-7b-peft-finetuning-exp4 with id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ──────
[05:17:15] Launching job kyc_edd_mistral_peft_finetuning-exp4 for experiment                                        experiment.py:798
           kyc-edd-mistral-7b-peft-finetuning-exp4
Running on cluster: kyc-edd-mistral-finetune
⚙︎ Uploading files to API server
✓ Files uploaded  View logs: ~/sky_logs/file_uploads/sky-2026-04-06-05-18-35-880552-6f6b193d.log
Considered resources (1 node):
-------------------------------------------------------------------------------------
 INFRA                     INSTANCE   vCPUs   Mem(GB)   GPUS     COST ($)   CHOSEN
-------------------------------------------------------------------------------------
 Kubernetes (in-cluster)   -          16      64        H100:4   0.00          ✔
-------------------------------------------------------------------------------------
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: kyc-edd-mistral-finetune.  View logs: sky logs --provision kyc-edd-mistral-finetune
⚙︎ Syncing files.
  Syncing (to 1 node): /root/.sky/api_server/clients/f04b916e/file_mounts/root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4 -> ~/.sky/file_mounts/nemo_run
✓ Synced file_mounts.  View logs: sky api logs -l sky-2026-04-06-05-18-39-436481/file_mounts.log
✓ Setup detached.
⚙︎ Job submitted, ID: 1
Cluster(s) not found: kyc-edd-mistral-finetune.
[05:21:57] INFO     Launched app:                                                                                     launcher.py:116
                    skypilot://nemo_run/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune
                    ___kyc_edd_mistral_peft_finetuning-exp4___1
──────────────────────── Waiting for Experiment kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 to finish ────────────────────────
Cluster(s) not found: kyc-edd-mistral-finetune.
Cluster(s) not found: kyc-edd-mistral-finetune.

Experiment Status for kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635

Task 0: kyc_edd_mistral_peft_finetuning-exp4
- Status: SUBMITTED
- Executor: SkypilotExecutor
- Job id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_peft_finetuning-exp4___1
- Local Directory: /root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4

[05:21:58] INFO     Waiting for job                                                                                   launcher.py:136
                    kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_p
                    eft_finetuning-exp4___1 to finish [log=False]...
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│                                                                                                                                   │
│   kyc_edd_mistral_peft_finetuning-exp4 UNKNOWN   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:10   │
│                                                                                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# The experiment was run with the following tasks: ['kyc_edd_mistral_peft_finetuning-exp4']
# You can inspect and reconstruct this experiment at a later point in time using:
experiment = run.Experiment.from_id("kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635")
experiment.status() # Gets the overall status
experiment.logs("kyc_edd_mistral_peft_finetuning-exp4") # Gets the log for the provided task
experiment.cancel("kyc_edd_mistral_peft_finetuning-exp4") # Cancels the provided task if still running


# You can inspect this experiment at a later point in time using the CLI as well:
nemo experiment status kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635
nemo experiment logs kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0
nemo experiment cancel kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0

```

At the same time, the Kubernetes pod is actually running, and the training job proceeds normally when checked via `kubectl` / k9s. Running pods screenshot from AKS is given below for reference 

<img width="935" height="334" alt="Image" src="https://github.com/user-attachments/assets/b8740b79-9846-4bfb-9847-4a1408704928" />
 




## Expected Behavior

- NeMo Run should be able to track the job after submission.
- `nemo experiment status` should reflect the real runtime state.
- `nemo experiment logs` should stream or retrieve logs for the running job.

## Actual Behavior

- The workload runs on Kubernetes, but NeMo Run prints `Cluster(s) not found`.
- Experiment state remains `SUBMITTED` instead of transitioning to `RUNNING` / terminal states.
- Logs are not available via NeMo Run and are only visible via `kubectl logs` or the Kubernetes UI.
- The local `sky status` / `sky logs` workflow also does not reflect these API-server-managed jobs.


## Request

Please confirm whether `SkypilotExecutor` is expected to support status/log tracking when jobs are launched through a remote SkyPilot API server on Kubernetes.

If yes, this looks like a NeMo Run integration bug.

If no, the limitation should be documented clearly, including the recommended way to monitor logs and status for this mode.

## Additional Question

Is there any extra SkyPilot configuration required for Kubernetes-based jobs to show logs in the SkyPilot dashboard, or is dashboard log visibility currently unsupported for this execution path?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SkypilotExecutor cannot track status or logs for API-server-managed Kubernetes jobs #482

Summary

Environment

Minimal Reproducer

Observed Output

Expected Behavior

Actual Behavior

Request

Additional Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SkypilotExecutor cannot track status or logs for API-server-managed Kubernetes jobs #482

Description

Summary

Environment

Minimal Reproducer

Observed Output

Expected Behavior

Actual Behavior

Request

Additional Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions