-
Notifications
You must be signed in to change notification settings - Fork 93
SkypilotExecutor cannot track status or logs for API-server-managed Kubernetes jobs #482
Description
Summary
When launching a job with run.SkypilotExecutor against a remote SkyPilot API server on Kubernetes, the pod comes up and the workload runs, but NeMo Run cannot continue tracking it. After launch, local status/log calls report Cluster(s) not found, nemo experiment status stays at SUBMITTED, and nemo experiment logs does not provide the running job logs.
Environment
nemo-runversion: please fill exact version frompip show nemo-runskypilotversion: please fill exact version frompip show skypilot- Python: 3.11.9
- Backend: SkyPilot API server + Kubernetes
SKYPILOT_API_SERVER_ENDPOINT:<skypilot api endpoint>
Minimal Reproducer
import os
os.environ["SKYPILOT_API_SERVER_ENDPOINT"] = "<SKY-PILOT-API-SERVER-URL>"
import nemo_run as run
from nemo.collections import llm
import nemo.lightning as nl
from lightning.pytorch.loggers import MLFlowLogger
from nemo.collections.llm.peft.lora import LoRA
from lightning.pytorch.callbacks import EarlyStopping
NEMO_MODEL_PATH = "/mnt/models/nemo-models/Mistral-7B-v0.3"
DATASET_ROOT = "/mnt/datasets/kyc-edd-v1"
experiment_name="kyc-edd-mistral-lora-ft-exp4"
run_name="kyc-edd-mistral-lora-ft-exp4-lora-r32a64-100steps"
OUTPUT_DIR = "/mnt/experiments/kyc-edd-mistral-lora-ft-exp4"
LOG_DIR = "/mnt/experiments/kyc_edd_mistral_lora_ft_logs-exp4"
MLFLOW_TRACKING_URI = "<ML-FLOW-URL>"
def configure_recipe(nodes: int = 1, gpus_per_node: int = 4):
recipe = llm.mistral_7b.finetune_recipe(
dir=OUTPUT_DIR,
name="mistral_lora",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
peft_scheme="lora",
)
recipe.resume = run.Config(
nl.AutoResume,
restore_config=run.Config(
nl.RestoreConfig,
path=NEMO_MODEL_PATH,
),
resume_if_exists=True,
)
recipe.data = run.Config(
llm.FineTuningDataModule,
dataset_root=DATASET_ROOT,
seq_length=4096,
micro_batch_size=1,
global_batch_size=64,
)
ckpt = run.Config(
nl.ModelCheckpoint,
save_last=True,
every_n_train_steps=100,
save_weights_only=False,
always_save_context=True, # save context/ on each checkpoint
save_context_on_train_end=True, # also save on final checkpoint
)
recipe.log = run.Config(
nl.NeMoLogger,
name="mistral-lora-ft",
log_dir=LOG_DIR,
use_datetime_version=False,
ckpt=ckpt,
explicit_log_dir=LOG_DIR,
extra_loggers=[
run.Config(
MLFlowLogger,
experiment_name=experiment_name,
run_name=run_name,
tracking_uri=MLFLOW_TRACKING_URI,
log_model=False,
)
],
)
recipe.peft = run.Config(
LoRA,
target_modules=[ 'linear_qkv',
'linear_proj',
'linear_fc1',
'linear_fc2'],
exclude_modules=[],
dim=32,
alpha=64,
dropout=0.05,
dropout_position='pre',
lora_A_init_method='xavier',
lora_B_init_method='zero',
a2a_experimental=False,
lora_dtype=None,
dropout_recompute=False
)
early_stop = run.Config(
EarlyStopping,
monitor="val_loss", # must be a metric that is actually logged
mode="min", # lower val_loss is better
patience=3, # 3 validation checks without improvement
min_delta=0.0,
strict=True,
verbose=True,
)
recipe.trainer.max_steps = 1000
recipe.trainer.num_sanity_val_steps = 0
recipe.trainer.val_check_interval = 5
recipe.trainer.strategy.ckpt_async_save = False
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.strategy.ddp = "megatron"
#recipe.log.ckpt.save_context_on_train_end=True
if recipe.trainer.callbacks is None:
recipe.trainer.callbacks = []
recipe.trainer.callbacks.append(early_stop)
return recipe
def skypilot_executor(nodes: int = 1, gpus_per_node: int = 4) -> run.SkypilotExecutor:
return run.SkypilotExecutor(
gpus="H100",
gpus_per_node=gpus_per_node,
num_nodes=nodes,
cloud="kubernetes",
container_image="nvcr.io/nvidia/nemo:25.07",
cluster_name="kyc-edd-mistral-finetune",
setup="pip install mlflow>=1.0.0",
env_vars={
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
"NCCL_NVLS_ENABLE": "0",
"NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
"NVTE_ASYNC_AMAX_REDUCTION": "1",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
"MLFLOW_TRACKING_URI": MLFLOW_TRACKING_URI,
},
)
def finetune_mistral():
nodes = 1
gpus_per_node = 4
recipe = configure_recipe(nodes=nodes, gpus_per_node=gpus_per_node)
executor = skypilot_executor(nodes=nodes, gpus_per_node=gpus_per_node)
with run.Experiment("kyc-edd-mistral-7b-peft-finetuning-exp4") as exp:
exp.add(recipe, executor=executor, name="kyc_edd_mistral_peft_finetuning-exp4")
exp.run(sequential=True, tail_logs=False)
if __name__ == "__main__":
finetune_mistral()Observed Output
/nemo-run/mistral-7B-PEFT# python ./mistral-mlflow-finetune.py
[NeMo W 2026-04-06 05:16:36 nemo_logging:364] /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'Could not load this library: /root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchvision/image.so'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
INFO:numexpr.utils:NumExpr defaulting to 12 threads.
INFO:megatron.core.msc_utils:The multistorageclient package is available.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.mixed_precision:Using Megatron-FSDP without Transformer Engine.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.param_and_grad_buffer:Detected Megatron Core, using Megatron-FSDP with Megatron.
INFO:megatron.core.distributed.fsdp.src.megatron_fsdp.megatron_fsdp:Detected Megatron Core, using Megatron-FSDP with Megatron.
WARNING:nv_one_logger.api.config:OneLogger: Setting error_handling_strategy to DISABLE_QUIETLY_AND_REPORT_METRIC_ERROR for rank (rank=0) with OneLogger disabled. To override: explicitly set error_handling_strategy parameter.
INFO:nv_one_logger.exporter.export_config_manager:Final configuration contains 0 exporter(s)
WARNING:nv_one_logger.training_telemetry.api.training_telemetry_provider:No exporters were provided. This means that no telemetry data will be collected.
WARNING:nemo.collections.llm.gpt.model.megatron.hyena.hyena_mixer:WARNING: transformer_engine not installed. Using default recipe.
[NeMo W 2026-04-06 05:17:15 __init__:442] The deploy module could not be imported: cannot import name 'deploy' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
[NeMo W 2026-04-06 05:17:15 __init__:449] The evaluate module could not be imported: cannot import name 'evaluate' from 'nemo.collections.llm.api' (/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/nemo/collections/llm/api.py)
────── Entering Experiment kyc-edd-mistral-7b-peft-finetuning-exp4 with id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ──────
[05:17:15] Launching job kyc_edd_mistral_peft_finetuning-exp4 for experiment experiment.py:798
kyc-edd-mistral-7b-peft-finetuning-exp4
Running on cluster: kyc-edd-mistral-finetune
⚙︎ Uploading files to API server
✓ Files uploaded View logs: ~/sky_logs/file_uploads/sky-2026-04-06-05-18-35-880552-6f6b193d.log
Considered resources (1 node):
-------------------------------------------------------------------------------------
INFRA INSTANCE vCPUs Mem(GB) GPUS COST ($) CHOSEN
-------------------------------------------------------------------------------------
Kubernetes (in-cluster) - 16 64 H100:4 0.00 ✔
-------------------------------------------------------------------------------------
⚙︎ Launching on Kubernetes.
└── Pod is up.
✓ Cluster launched: kyc-edd-mistral-finetune. View logs: sky logs --provision kyc-edd-mistral-finetune
⚙︎ Syncing files.
Syncing (to 1 node): /root/.sky/api_server/clients/f04b916e/file_mounts/root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4 -> ~/.sky/file_mounts/nemo_run
✓ Synced file_mounts. View logs: sky api logs -l sky-2026-04-06-05-18-39-436481/file_mounts.log
✓ Setup detached.
⚙︎ Job submitted, ID: 1
Cluster(s) not found: kyc-edd-mistral-finetune.
[05:21:57] INFO Launched app: launcher.py:116
skypilot://nemo_run/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune
___kyc_edd_mistral_peft_finetuning-exp4___1
──────────────────────── Waiting for Experiment kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 to finish ────────────────────────
Cluster(s) not found: kyc-edd-mistral-finetune.
Cluster(s) not found: kyc-edd-mistral-finetune.
Experiment Status for kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635
Task 0: kyc_edd_mistral_peft_finetuning-exp4
- Status: SUBMITTED
- Executor: SkypilotExecutor
- Job id: kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_peft_finetuning-exp4___1
- Local Directory: /root/.nemo_run/experiments/kyc-edd-mistral-7b-peft-finetuning-exp4/kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635/kyc_edd_mistral_peft_finetuning-exp4
[05:21:58] INFO Waiting for job launcher.py:136
kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635___kyc-edd-mistral-finetune___kyc_edd_mistral_p
eft_finetuning-exp4___1 to finish [log=False]...
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
╭─────────────────────────────────────── kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 ────────────────────────────────────────╮
│ │
│ kyc_edd_mistral_peft_finetuning-exp4 UNKNOWN ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:10 │
│ │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
# The experiment was run with the following tasks: ['kyc_edd_mistral_peft_finetuning-exp4']
# You can inspect and reconstruct this experiment at a later point in time using:
experiment = run.Experiment.from_id("kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635")
experiment.status() # Gets the overall status
experiment.logs("kyc_edd_mistral_peft_finetuning-exp4") # Gets the log for the provided task
experiment.cancel("kyc_edd_mistral_peft_finetuning-exp4") # Cancels the provided task if still running
# You can inspect this experiment at a later point in time using the CLI as well:
nemo experiment status kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635
nemo experiment logs kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0
nemo experiment cancel kyc-edd-mistral-7b-peft-finetuning-exp4_1775452635 0
At the same time, the Kubernetes pod is actually running, and the training job proceeds normally when checked via kubectl / k9s. Running pods screenshot from AKS is given below for reference
Expected Behavior
- NeMo Run should be able to track the job after submission.
nemo experiment statusshould reflect the real runtime state.nemo experiment logsshould stream or retrieve logs for the running job.
Actual Behavior
- The workload runs on Kubernetes, but NeMo Run prints
Cluster(s) not found. - Experiment state remains
SUBMITTEDinstead of transitioning toRUNNING/ terminal states. - Logs are not available via NeMo Run and are only visible via
kubectl logsor the Kubernetes UI. - The local
sky status/sky logsworkflow also does not reflect these API-server-managed jobs.
Request
Please confirm whether SkypilotExecutor is expected to support status/log tracking when jobs are launched through a remote SkyPilot API server on Kubernetes.
If yes, this looks like a NeMo Run integration bug.
If no, the limitation should be documented clearly, including the recommended way to monitor logs and status for this mode.
Additional Question
Is there any extra SkyPilot configuration required for Kubernetes-based jobs to show logs in the SkyPilot dashboard, or is dashboard log visibility currently unsupported for this execution path?