Make docker registry independent of deployed environment#775
Open
lukasewecker wants to merge 7 commits intomainfrom
Open
Make docker registry independent of deployed environment#775lukasewecker wants to merge 7 commits intomainfrom
lukasewecker wants to merge 7 commits intomainfrom
Conversation
model-engine/model_engine_server/infra/repositories/generic_docker_repository.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/infra/repositories/generic_docker_repository.py
Outdated
Show resolved
Hide resolved
dmchoiboi
reviewed
Mar 11, 2026
dmchoiboi
approved these changes
Mar 12, 2026
Collaborator
dmchoiboi
left a comment
There was a problem hiding this comment.
lgtm; maybe address greptile comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Summary
Decouple Docker registry from cloud provider
Currently, the Docker registry used for pulling inference framework images (vLLM, TGI, etc.) is hardcoded to the cloud
provider: ECR on AWS, ACR on Azure, GAR on GCP. This prevents using alternative registries like public ECR, Docker Hub, or
GHCR regardless of where the service is deployed.
Changes
Add
GenericDockerRepository— a registry-agnostic implementation that uses the standard Docker Registry V2 HTTP APIto validate image existence. Handles anonymous access and
Www-Authenticatebearer token negotiation per the OCIdistribution spec.
Add
docker_registry_typeconfig field — optional override ("ecr","acr","gar","generic","onprem"). Whenunset, the registry type is inferred from
docker_repo_prefix:*.dkr.ecr.*.amazonaws.com→ ECR*.azurecr.io→ ACR*-docker.pkg.dev→ GARUnify registry selection in both
api/dependencies.pyandentrypoints/k8s_cache.pyto use the same inference logic.Remove redundant on-prem bypass in
check_docker_image_exists_for_image_tag—OnPremDockerRepository.image_exists()already returns
True, so thecloud_provider == "onprem"guard was unnecessary.Backward compatibility
Existing deployments require no config changes. Private ECR, ACR, and GAR prefixes are auto-detected and routed to their
existing native implementations. On-prem deployments that previously relied on the
cloud_provider == "onprem"check shouldset
docker_registry_type: "onprem"explicitly.Usage
To use public ECR images on any cloud provider:
Test Plan and Usage Guide
Ran tests, and added
model-engine/tests/unit/infra/repositories/test_generic_docker_repository.py.Greptile Summary
This PR decouples the Docker registry implementation from the cloud provider by introducing a
GenericDockerRepositorythat uses the OCI Distribution V2 HTTP API, and a newdocker_registry_typeconfig field with auto-inference from thedocker_repo_prefix. The changes unify registry selection logic across bothapi/dependencies.pyandentrypoints/k8s_cache.py.Key changes:
GenericDockerRepository— correctly splitsdocker_repo_prefixinto registry host + path prefix, handles anonymous and Bearer token auth flows, and correctly builds manifest URLs for registries likepublic.ecr.awswith path components.infer_registry_typehelper — auto-detects ECR, ACR (using substring match, consistent with ECR/GAR), and GAR from the prefix; falls back to"generic"for all other registries.k8s_cache.pyunified — GAR branch added, now mirrorsdependencies.pyexactly.cloud_provider == "onprem"early-return removed fromcheck_docker_image_exists_for_image_tag; on-prem deployments must now explicitly setdocker_registry_type: "onprem"or they will silently fall through toGenericDockerRepository, which makes live HTTP requests and will fail in airgapped environments.Confidence Score: 3/5
cloud_provider: onpremwithout settingdocker_registry_type: onprem.cloud_provider == "onprem"guard with no backward-compatible inference fallback, which can silently break on-prem image validation checks on upgrade.model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.pyandmodel-engine/model_engine_server/core/config.py— the interaction between removing the on-prem guard and the lack of on-prem inference ininfer_registry_typeis the primary risk area.Important Files Changed
?token URL issue. Minor: synchronousrequestscalls in an async context match the existing pattern used by other repositories.docker_registry_typeoptional field to_InfraConfigand theinfer_registry_typehelper; ECR, ACR (usingin), and GAR detection are correct, but there is no inference path for"onprem"which creates a silent regression for on-prem deployments.GenericDockerRepositoryimport added.GARDockerRepositoryandGenericDockerRepositoryimported, registry logic unified withdependencies.py. Minor:infer_registry_typeis imported in a separate statement instead of being combined with the existinginfra_configimport.cloud_provider == "onprem"guard is a silent breaking change: on-prem deployments that have not addeddocker_registry_type: onpremwill fall through toGenericDockerRepositoryand fail image-existence checks.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Registry Selection] --> B{docker_registry_type\nexplicitly set?} B -- Yes --> C[Use configured type] B -- No --> D[infer_registry_type\ndocker_repo_prefix] D --> E{prefix contains\n.dkr.ecr. + .amazonaws.com?} E -- Yes --> F[ecr] E -- No --> G{prefix contains\n.azurecr.io?} G -- Yes --> H[acr] G -- No --> I{prefix contains\n-docker.pkg.dev?} I -- Yes --> J[gar] I -- No --> K[generic] C --> L{registry_type value} F --> L H --> L J --> L K --> L L -- ecr --> M[ECRDockerRepository] L -- acr --> N[ACRDockerRepository] L -- gar --> O[GARDockerRepository] L -- onprem --> P[OnPremDockerRepository\nalways returns True] L -- generic/other --> Q[GenericDockerRepository\nOCI V2 HTTP API] Q --> R{HEAD /v2/repo/manifests/tag} R -- 200 --> S[image exists: True] R -- 401 --> T[Fetch Bearer token\nvia Www-Authenticate realm] T --> U{Token obtained?} U -- Yes --> V[Retry HEAD with\nAuthorization header] V -- 200 --> S V -- other --> W[image exists: False] U -- No --> W R -- other --> WComments Outside Diff (1)
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py, line 387-397 (link)Silent breaking change for on-prem deployments
The
cloud_provider == "onprem"early-return guard has been removed, andinfer_registry_typehas no way to infer"onprem"from a prefix (it only detects ECR/ACR/GAR patterns, otherwise returning"generic"). Any on-prem deployment whose config hascloud_provider: onprembut has not yet addeddocker_registry_type: onpremwill now silently:GenericDockerRepository(instead ofOnPremDockerRepository).image_existscheck.RequestException, causing the check to returnFalseand raiseDockerImageNotFoundException— preventing new LLM endpoints from being created.OnPremDockerRepository.image_exists()always returnsTrue, so the old behaviour was intentional. The PR description acknowledges the migration requirement, but this is effectively a breaking change for any on-prem user who upgrades without updating their config.Consider adding a transitional safety net: if
docker_registry_typeis unset andcloud_provider == "onprem", fall back to"onprem"rather than"generic":Or at minimum ensure the operator-facing upgrade notes are surfaced as a warning log at startup when
cloud_provider == "onprem"anddocker_registry_typeis unset.Prompt To Fix All With AI
Last reviewed commit: ed9d942