🐛 Fix CRD validation to only validate CRs against storage version schema#3798
🐛 Fix CRD validation to only validate CRs against storage version schema#3798jianzhangbjz wants to merge 1 commit intooperator-framework:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
kevinrizza
left a comment
There was a problem hiding this comment.
I think I need to see an enhancement or design document for how we want to introduce this change as well as a comprehensive test plan in order to approve this pr.
| // When a CRD has multiple versions with conversion.strategy=None, CRs are stored in the | ||
| // storage version format. We should only validate CRs against the storage version's schema, | ||
| // not against all served versions, because without a conversion webhook the CR data won't | ||
| // be converted to match other versions' schemas. |
There was a problem hiding this comment.
This comment is oddly phrased. A conversion webhook existing or not doesn't change whether CRs should be validated. Yes, a conversion webhook can allow a client to translate a resource from one version to another, but the lack of a conversion webhook doesn't mean that a client that is trying to access a resource at one version doesn't have to worry about version mismatch if a webhook doesn't exist.
| } | ||
|
|
||
| if oldStorageVersion == "" { | ||
| return fmt.Errorf("no storage version found in old CRD %s", oldCRD.Name) |
There was a problem hiding this comment.
This isn't a possible case. There must always be exactly one version in the CRD spec where storage is true. If not, this CSV should fail admission.
| // 1. All existing CRs are stored in the old storage version format | ||
| // 2. When conversion.strategy is "None", Kubernetes doesn't convert CR data between versions | ||
| // 3. Validating CRs against other versions' schemas would incorrectly fail when those | ||
| // versions have different required fields |
There was a problem hiding this comment.
I'm going to push back on this premise a little bit. The whole purpose of this schema validation is to ensure that new versions being introduced do not ratchet validation against existing data on the cluster. But why would a user ever want to introduce a new served version if it cannot actually be served? It is true that this will technically not break anything, but the whole point of this conversion validation is to make sure that data is being migrated to a new valid format in the previous version before introducing the new version.
If we are going to change this behavior, I think we need to write down exactly what test cases we are expecting this to help with in a way that is verifiable and include them as e2e tests that prove that we aren't accidentally regressing in other areas where this validation is still what we want to do.
I would also argue that what probably should happen here is that this should be an opt in for operator developers. The whole point of OLM is to restrict what upgrades we do and don't support to ensure that users that are not experts in CRD lifecycle don't accidentally shoot themselves in the foot. If a user wants to intentionally introduce a CRD version that cannot actually be served because there is no way to convert the data automatically, then they are breaking clients and they should know that they need to tell their users as much.
|
The goal of the stored+serving validation is that we avoid conditions which can make the resulting resources unmanageable. A listed resource via the newer CRD API version could not be applied against the newer API, breaking any new client. I do agree with Kevin that any attempt to change the current scheme needs to be researched, supported, and evangelized broadly. |
Description of the change:
Before:
After:
Why This Is Safe
SafeStorageVersionUpgradecheck that prevents unsafe storage version changes.conversion.strategy: Webhookby allowing upgrades to proceed with a warning, trusting the webhook to convert data correctly.Motivation for the change:
When upgrading a CRD that has multiple versions with
conversion.strategy: None, OLM incorrectly validates existing CRs against all served versions' schemas. This causes upgrade failures when different versions have different required fields.More: https://redhat.atlassian.net/browse/OCPBUGS-78710
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc[FLAKE]are truly flaky and have an issueAssisted-By: Claude-Code