Fix KubernetesExecutor scheduler crash from unpicklable pod_override#68831
Fix KubernetesExecutor scheduler crash from unpicklable pod_override#68831ephraimbuddy wants to merge 2 commits into
Conversation
phanikumv
left a comment
There was a problem hiding this comment.
The same ApiClient() deserialize code still lives in the below two paths.
- providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py — L381–L382 (api_client = ApiClient() → _ApiClient__deserialize_model(job, k8s.V1Job))
- providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py — L39–L40 (api_client = ApiClient() → _ApiClient__deserialize_model(obj, new_class))
Neither is on the path where a deserialized kubernetes model gets pickle-serialized, however worth fixing them as a follow up in another PR..?
Yeah, we can do a follow up since those are not on the pickle path |
736cbee to
f029434
Compare
f029434 to
902fbf3
Compare
eladkal
left a comment
There was a problem hiding this comment.
Can we split this PR to core and providers?
Splitting |
PR: #68848 |
153aedd to
5490346
Compare
phanikumv
left a comment
There was a problem hiding this comment.
Both the PRs are currently missing the airflow-core scheduler fix. Please restore yesterday's serialized_objects.py + sqlalchemy.py changes
amoghrajesh
left a comment
There was a problem hiding this comment.
The fix is fine for pods deserialized from yaml / dicst but misses directly instantiated V1Pod objects via the executor_config.
When a user constructs a V1Pod directly in their dag, it never goes through `ApiClient.__deserialize_model. Requesting changes to avoid accidental merge.
5490346 to
f8c5f27
Compare
Yeah, there was a mix up. I have pushed the actual change. |
There was a problem hiding this comment.
Pull request overview
This PR prevents scheduler crashes in KubernetesExecutor when a task sets pod_override to a kubernetes.client.models.V1Pod that becomes unpicklable in-cluster (due to kubernetes-client v36 attaching the process-global in-cluster Configuration with an unpicklable refresh_api_key_hook closure). It centralizes pod dict deserialization through an ApiClient constructed with a fresh Configuration, ensuring deserialized pods (and nested models) remain picklable across core serialization and the “repair after unpickling” path.
Changes:
- Add
deserialize_pod_dict()that deserializes pod dicts viaApiClient(configuration=Configuration())to avoid capturing the in-cluster globalConfiguration. - Update both
BaseSerialization.deserialize()(POD branch) andensure_pod_is_valid_after_unpickling()to use the new safe deserialization helper. - Add regression tests covering both the serialization/deserialization path and the unpickling repair path for picklability in-cluster.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
airflow-core/src/airflow/utils/sqlalchemy.py |
Introduces safe pod dict deserialization with a fresh Kubernetes Configuration and reuses it in the unpickling repair path. |
airflow-core/src/airflow/serialization/serialized_objects.py |
Switches POD deserialization to the safe helper to keep deserialized pods picklable. |
airflow-core/tests/unit/utils/test_sqlalchemy.py |
Adds a test ensuring the unpickling repair path produces a picklable pod even with a dirty in-cluster default Configuration. |
airflow-core/tests/unit/serialization/test_serialized_objects.py |
Adds a test ensuring serialized/deserialized pods do not capture an unpicklable in-cluster default Configuration. |
phanikumv
left a comment
There was a problem hiding this comment.
LGTM now after the last commit
amoghrajesh
left a comment
There was a problem hiding this comment.
Some nits, otherwise LGTM!
| the pod it must round-trip through a fresh ``Configuration`` so it stays picklable onto the | ||
| KubernetesExecutor queue. | ||
| """ | ||
| from kubernetes.client import Configuration |
There was a problem hiding this comment.
Same here (but this one can stay actually)
| assert decoded.local_vars_configuration.refresh_api_key_hook is None | ||
| assert decoded.spec.containers[0].local_vars_configuration.refresh_api_key_hook is None | ||
|
|
||
| _has_kubernetes.cache_clear() |
There was a problem hiding this comment.
Is this really needed? I guess its bleeding from other tests but yeah just flagging
When the scheduler runs in-cluster, the kubernetes client installs a process-global default Configuration whose refresh_api_key_hook is a local closure. Under kubernetes-client v36, deserializing a V1Pod copies that Configuration onto the pod and every nested model. A task that sets a V1Pod pod_override therefore produces a pod that pickle cannot serialize, and the KubernetesExecutor pickles each task onto a multiprocessing queue synchronously in the scheduler loop, so the scheduler crashes in a loop and no task can be launched. Deserialize pods through an ApiClient built with a fresh Configuration so that neither the pod nor any nested model captures the in-cluster global. This is applied where the config gets attached -- airflow-core's serialization and unpickling repair paths -- so every consumer of a deserialized pod stays picklable.
f8c5f27 to
fb76af2
Compare
Move the deserialize_pod_dict import to module top-level, hoist pickle and Configuration to top-level imports in the serialization test, and drop the redundant _has_kubernetes cache_clear calls that were copied from a sibling test.
fb76af2 to
154a011
Compare
When the scheduler runs in-cluster, the kubernetes client installs a process-global default Configuration whose refresh_api_key_hook is a local closure. Under kubernetes-client v36, deserializing a V1Pod copies that Configuration onto the pod and every nested model. A task that sets a V1Pod pod_override therefore produces a pod that pickle cannot serialize, and the KubernetesExecutor pickles each task onto a multiprocessing queue synchronously in the scheduler loop, so the scheduler crashes in a loop and no task can be launched.
Deserialize pods through an ApiClient built with a fresh Configuration so that neither the pod nor any nested model captures the in-cluster global. This is applied at the point the config gets attached, so every consumer of a deserialized pod stays picklable: airflow-core's serialization and unpickling repair paths, and the cncf provider's worker-side reconstruction.
Provider counter part of this PR: #68848
Was generative AI tooling used to co-author this PR?
Claude Opus 4.8