Skip to content

vm-disk from golden image stays unbound forever when target StorageClass differs from the image's #2908

@lexfrei

Description

Summary

Creating a VM disk from a golden image into a StorageClass different from the one the golden images live in never completes: the temporary clone PVC stays unbound forever and no error is surfaced to the user. Reported in the community chat on v1.4.2 with LINSTOR storage (golden images in a replicated StorageClass, target disk in a node-local one); the relevant code is unchanged on main.

Symptom

  1. Golden images (vm-default-images-* DataVolumes in cozy-public) live in a replicated StorageClass.
  2. The user creates a vm-disk with source.image and a local target StorageClass.
  3. CDI attempts a CSI clone: it creates a temporary PVC tmp-pvc-… in cozy-public with the target StorageClass (WaitForFirstConsumer) plus a prep-… pod.
  4. The PVC never binds; the prep pod stays stuck for hours with Unable to attach or mount volumes: PVC cozy-public/tmp-pvc-… is not bound, and LINSTOR CSI logs context deadline exceeded.

Mechanism

  1. packages/apps/vm-disk/templates/dv.yaml renders a DataVolume whose source.pvc points at cozy-public/vm-default-images-* while storage.storageClassName is the user-selected target class, so picking any class other than the images' one produces a cross-StorageClass clone request.
  2. packages/system/kubevirt-cdi/templates/cdi-cr.yaml sets a global cloneStrategyOverride: csi-clone on the CDI CR.
  3. CDI's snapshot-based smart clone has a "source and target PVCs must be in the same Storage Class" guard with an automatic fallback to host-assisted copy, but the csi-clone path has no such guard — Kubernetes allows cross-StorageClass PVC cloning at the API level and delegates satisfiability to the CSI driver — so the forced csi-clone is issued as-is.
  4. A local StorageClass (placement count 1, no remote volume access) can only materialize a clone on a node that already holds the source data. With WaitForFirstConsumer the prep pod is scheduled without regard to source-data placement, so LINSTOR CreateVolume cannot satisfy the request and times out; the exact LINSTOR-side failure mode (cross-resource-group clone vs node placement) is worth confirming from CSI logs, but both are unsatisfiable for a local target here.

Trigger vs root cause

Trigger: picking a target StorageClass different from the golden images' class. Root cause: the global cloneStrategyOverride: csi-clone (introduced with the golden-disks feature in d38c8aa, switched from snapshot to csi-clone in 42778cf) bypasses the same-StorageClass guard and host-assisted fallback that CDI's default per-StorageProfile strategy selection would apply.

Possible fix directions

  1. Drop the global override and configure the clone strategy per StorageProfile, so the same-StorageClass guard fires and cross-class clones fall back to the slower but correct host-assisted copy. Whether the LINSTOR StorageProfile default resolves to the snapshot strategy should be verified while implementing.
  2. Additionally, fail fast or warn in vm-disk / the dashboard when the requested StorageClass differs from the golden images' one, since host-assisted copy of large images is significantly slower and the user should know what they are opting into.

Workaround

Create the disk in the same StorageClass the golden images live in (replicated in the default layout).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/storageIssues or PRs related to storage (linstor, seaweedfs, bucket, velero, harbor)area/virtualizationIssues or PRs related to virtualization (kubevirt, cdi, vmi, vm-import)kind/bugCategorizes issue or PR as related to a bugpriority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next releasetriage/acceptedIndicates an issue is ready to be actively worked on

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions