Skip to content

🐛 Fix CRD validation to only validate CRs against storage version schema#3798

Open
jianzhangbjz wants to merge 1 commit intooperator-framework:masterfrom
jianzhangbjz:OCPBUGS-78710
Open

🐛 Fix CRD validation to only validate CRs against storage version schema#3798
jianzhangbjz wants to merge 1 commit intooperator-framework:masterfrom
jianzhangbjz:OCPBUGS-78710

Conversation

@jianzhangbjz
Copy link
Member

Description of the change:

Before:

// Validate against ALL served versions
for _, newVersion := range newCRD.Spec.Versions {
    if oldVersionSet.Has(newVersion.Name) && newVersion.Served {
        validationsMap[newVersion.Name] = newVersion.Schema
    }
}

After:

  // Find the storage version
  var oldStorageVersion string
  for _, v := range oldCRD.Spec.Versions {
      if v.Storage {
          oldStorageVersion = v.Name
          break
      }
  }

  // Only validate against storage version's schema
  for _, newVersion := range newCRD.Spec.Versions {
      if newVersion.Name == oldStorageVersion && newVersion.Served {
          validationsMap[newVersion.Name] = newVersion.Schema
          break
      }
  }

Why This Is Safe

  1. Storage version semantics: CRs are always stored in the storage version format, so validating against the storage version's schema is the correct behavior.
  2. Existing safety checks: OLM already has SafeStorageVersionUpgrade check that prevents unsafe storage version changes.
  3. Webhook conversion handling: The existing code already handles conversion.strategy: Webhook by allowing upgrades to proceed with a warning, trusting the webhook to convert data correctly.
  4. User-created CRs with non-storage versions: When users create CRs using a non-storage version (e.g., v1), Kubernetes validates them against that version's schema at creation time. The data will include all required fields, so validation against the storage version's schema (which has fewer required fields) will pass.

Motivation for the change:
When upgrading a CRD that has multiple versions with conversion.strategy: None, OLM incorrectly validates existing CRs against all served versions' schemas. This causes upgrade failures when different versions have different required fields.
More: https://redhat.atlassian.net/browse/OCPBUGS-78710

Architectural changes:

Testing remarks:

  • Updated existing unit tests to use storage version CRs, which accurately reflects how Kubernetes stores CRs
  • All TestValidateV1CRDCompatibility tests pass
  • All TestValidateV1Beta1CRDCompatibility tests pass

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

Assisted-By: Claude-Code

@openshift-ci
Copy link

openshift-ci bot commented Mar 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign grokspawn for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jianzhangbjz jianzhangbjz changed the title Fix CRD validation to only validate CRs against storage version schema 🐛 Fix CRD validation to only validate CRs against storage version schema Mar 18, 2026
Copy link
Member

@kevinrizza kevinrizza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to see an enhancement or design document for how we want to introduce this change as well as a comprehensive test plan in order to approve this pr.

// When a CRD has multiple versions with conversion.strategy=None, CRs are stored in the
// storage version format. We should only validate CRs against the storage version's schema,
// not against all served versions, because without a conversion webhook the CR data won't
// be converted to match other versions' schemas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is oddly phrased. A conversion webhook existing or not doesn't change whether CRs should be validated. Yes, a conversion webhook can allow a client to translate a resource from one version to another, but the lack of a conversion webhook doesn't mean that a client that is trying to access a resource at one version doesn't have to worry about version mismatch if a webhook doesn't exist.

}

if oldStorageVersion == "" {
return fmt.Errorf("no storage version found in old CRD %s", oldCRD.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a possible case. There must always be exactly one version in the CRD spec where storage is true. If not, this CSV should fail admission.

// 1. All existing CRs are stored in the old storage version format
// 2. When conversion.strategy is "None", Kubernetes doesn't convert CR data between versions
// 3. Validating CRs against other versions' schemas would incorrectly fail when those
// versions have different required fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to push back on this premise a little bit. The whole purpose of this schema validation is to ensure that new versions being introduced do not ratchet validation against existing data on the cluster. But why would a user ever want to introduce a new served version if it cannot actually be served? It is true that this will technically not break anything, but the whole point of this conversion validation is to make sure that data is being migrated to a new valid format in the previous version before introducing the new version.

If we are going to change this behavior, I think we need to write down exactly what test cases we are expecting this to help with in a way that is verifiable and include them as e2e tests that prove that we aren't accidentally regressing in other areas where this validation is still what we want to do.

I would also argue that what probably should happen here is that this should be an opt in for operator developers. The whole point of OLM is to restrict what upgrades we do and don't support to ensure that users that are not experts in CRD lifecycle don't accidentally shoot themselves in the foot. If a user wants to intentionally introduce a CRD version that cannot actually be served because there is no way to convert the data automatically, then they are breaking clients and they should know that they need to tell their users as much.

@grokspawn
Copy link
Contributor

The goal of the stored+serving validation is that we avoid conditions which can make the resulting resources unmanageable. A listed resource via the newer CRD API version could not be applied against the newer API, breaking any new client.
Normally, our expectation would be that there is at least one release between offering the optional fields and requiring them, and during that breaking change timeline the operator team would have guidance to admins that they must ensure that they provide the optional fields before upgrading to the next version.
And people who don't ... well, they get an error when they attempt it, telling them that they need to $doThing.

I do agree with Kevin that any attempt to change the current scheme needs to be researched, supported, and evangelized broadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants