Skip to content

Make docker registry independent of deployed environment#775

Open
lukasewecker wants to merge 7 commits intomainfrom
lukas/docker-registry
Open

Make docker registry independent of deployed environment#775
lukasewecker wants to merge 7 commits intomainfrom
lukas/docker-registry

Conversation

@lukasewecker
Copy link
Collaborator

@lukasewecker lukasewecker commented Mar 10, 2026

Pull Request Summary

Decouple Docker registry from cloud provider

Currently, the Docker registry used for pulling inference framework images (vLLM, TGI, etc.) is hardcoded to the cloud
provider: ECR on AWS, ACR on Azure, GAR on GCP. This prevents using alternative registries like public ECR, Docker Hub, or
GHCR regardless of where the service is deployed.

Changes

  • Add GenericDockerRepository — a registry-agnostic implementation that uses the standard Docker Registry V2 HTTP API
    to validate image existence. Handles anonymous access and Www-Authenticate bearer token negotiation per the OCI
    distribution spec.

  • Add docker_registry_type config field — optional override ("ecr", "acr", "gar", "generic", "onprem"). When
    unset, the registry type is inferred from docker_repo_prefix:

    • *.dkr.ecr.*.amazonaws.com → ECR
    • *.azurecr.io → ACR
    • *-docker.pkg.dev → GAR
    • Everything else (public ECR, Docker Hub, GHCR, etc.) → generic
  • Unify registry selection in both api/dependencies.py and entrypoints/k8s_cache.py to use the same inference logic.

  • Remove redundant on-prem bypass in check_docker_image_exists_for_image_tagOnPremDockerRepository.image_exists()
    already returns True, so the cloud_provider == "onprem" guard was unnecessary.

Backward compatibility

Existing deployments require no config changes. Private ECR, ACR, and GAR prefixes are auto-detected and routed to their
existing native implementations. On-prem deployments that previously relied on the cloud_provider == "onprem" check should
set docker_registry_type: "onprem" explicitly.

Usage

To use public ECR images on any cloud provider:

config:
  values:
    infra:
      docker_repo_prefix: "public.ecr.aws/your-alias"
      # docker_registry_type is inferred as "generic" automatically

Test Plan and Usage Guide

Ran tests, and added model-engine/tests/unit/infra/repositories/test_generic_docker_repository.py.

Greptile Summary

This PR decouples the Docker registry implementation from the cloud provider by introducing a GenericDockerRepository that uses the OCI Distribution V2 HTTP API, and a new docker_registry_type config field with auto-inference from the docker_repo_prefix. The changes unify registry selection logic across both api/dependencies.py and entrypoints/k8s_cache.py.

Key changes:

  • New GenericDockerRepository — correctly splits docker_repo_prefix into registry host + path prefix, handles anonymous and Bearer token auth flows, and correctly builds manifest URLs for registries like public.ecr.aws with path components.
  • infer_registry_type helper — auto-detects ECR, ACR (using substring match, consistent with ECR/GAR), and GAR from the prefix; falls back to "generic" for all other registries.
  • k8s_cache.py unified — GAR branch added, now mirrors dependencies.py exactly.
  • On-prem guard removedcloud_provider == "onprem" early-return removed from check_docker_image_exists_for_image_tag; on-prem deployments must now explicitly set docker_registry_type: "onprem" or they will silently fall through to GenericDockerRepository, which makes live HTTP requests and will fail in airgapped environments.
  • Minor cosmetic cleanups (tuple unpacking, comment formatting in unrelated files).

Confidence Score: 3/5

  • Safe for new deployments but has a silent breaking change for existing on-prem deployments that rely on cloud_provider: onprem without setting docker_registry_type: onprem.
  • The core logic (URL construction, token auth, registry detection) is sound and well-tested. The main risk is the removal of the cloud_provider == "onprem" guard with no backward-compatible inference fallback, which can silently break on-prem image validation checks on upgrade.
  • model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py and model-engine/model_engine_server/core/config.py — the interaction between removing the on-prem guard and the lack of on-prem inference in infer_registry_type is the primary risk area.

Important Files Changed

Filename Overview
model-engine/model_engine_server/infra/repositories/generic_docker_repository.py New registry-agnostic implementation using the OCI Distribution V2 API; correctly splits the prefix into registry host + path prefix, handles anonymous and Bearer token auth, and fixes the double-? token URL issue. Minor: synchronous requests calls in an async context match the existing pattern used by other repositories.
model-engine/model_engine_server/core/config.py Adds docker_registry_type optional field to _InfraConfig and the infer_registry_type helper; ECR, ACR (using in), and GAR detection are correct, but there is no inference path for "onprem" which creates a silent regression for on-prem deployments.
model-engine/model_engine_server/api/dependencies.py Registry selection switched from cloud_provider-based to registry_type-based; all five branches (ECR, ACR, GAR, onprem, generic) are present and correctly ordered. GenericDockerRepository import added.
model-engine/model_engine_server/entrypoints/k8s_cache.py GAR branch added, GARDockerRepository and GenericDockerRepository imported, registry logic unified with dependencies.py. Minor: infer_registry_type is imported in a separate statement instead of being combined with the existing infra_config import.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Removal of the cloud_provider == "onprem" guard is a silent breaking change: on-prem deployments that have not added docker_registry_type: onprem will fall through to GenericDockerRepository and fail image-existence checks.
model-engine/tests/unit/infra/repositories/test_generic_docker_repository.py Good test coverage for the new repository: 200/404/401 flows, token auth negotiation, connection errors, and URL construction. Tests also validate the corrected manifest URL format with path-prefixed registries.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Registry Selection] --> B{docker_registry_type\nexplicitly set?}
    B -- Yes --> C[Use configured type]
    B -- No --> D[infer_registry_type\ndocker_repo_prefix]
    D --> E{prefix contains\n.dkr.ecr. + .amazonaws.com?}
    E -- Yes --> F[ecr]
    E -- No --> G{prefix contains\n.azurecr.io?}
    G -- Yes --> H[acr]
    G -- No --> I{prefix contains\n-docker.pkg.dev?}
    I -- Yes --> J[gar]
    I -- No --> K[generic]
    C --> L{registry_type value}
    F --> L
    H --> L
    J --> L
    K --> L
    L -- ecr --> M[ECRDockerRepository]
    L -- acr --> N[ACRDockerRepository]
    L -- gar --> O[GARDockerRepository]
    L -- onprem --> P[OnPremDockerRepository\nalways returns True]
    L -- generic/other --> Q[GenericDockerRepository\nOCI V2 HTTP API]
    Q --> R{HEAD /v2/repo/manifests/tag}
    R -- 200 --> S[image exists: True]
    R -- 401 --> T[Fetch Bearer token\nvia Www-Authenticate realm]
    T --> U{Token obtained?}
    U -- Yes --> V[Retry HEAD with\nAuthorization header]
    V -- 200 --> S
    V -- other --> W[image exists: False]
    U -- No --> W
    R -- other --> W
Loading

Comments Outside Diff (1)

  1. model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py, line 387-397 (link)

    Silent breaking change for on-prem deployments

    The cloud_provider == "onprem" early-return guard has been removed, and infer_registry_type has no way to infer "onprem" from a prefix (it only detects ECR/ACR/GAR patterns, otherwise returning "generic"). Any on-prem deployment whose config has cloud_provider: onprem but has not yet added docker_registry_type: onprem will now silently:

    1. Fall through to GenericDockerRepository (instead of OnPremDockerRepository).
    2. Make live HTTP requests to an internal or airgapped registry for every image_exists check.
    3. Likely fail with a RequestException, causing the check to return False and raise DockerImageNotFoundException — preventing new LLM endpoints from being created.

    OnPremDockerRepository.image_exists() always returns True, so the old behaviour was intentional. The PR description acknowledges the migration requirement, but this is effectively a breaking change for any on-prem user who upgrades without updating their config.

    Consider adding a transitional safety net: if docker_registry_type is unset and cloud_provider == "onprem", fall back to "onprem" rather than "generic":

    def infer_registry_type(prefix: str, cloud_provider: Optional[str] = None) -> str:
        if cloud_provider == "onprem":
            return "onprem"
        if ".dkr.ecr." in prefix and ".amazonaws.com" in prefix:
            return "ecr"
        ...

    Or at minimum ensure the operator-facing upgrade notes are surfaced as a warning log at startup when cloud_provider == "onprem" and docker_registry_type is unset.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: model-engine/model_engine_server/entrypoints/k8s_cache.py
Line: 43

Comment:
**Split import from the same module**

`infer_registry_type` is imported as a separate statement at line 43, but `infra_config` is already imported from the same module at line 17. Python style (PEP 8) and most linters expect a single `from module import a, b` when pulling multiple names from the same module. Having two separate import blocks from `model_engine_server.core.config` is inconsistent.

```suggestion
from model_engine_server.core.config import infer_registry_type, infra_config
```
(And remove the duplicate `from model_engine_server.core.config import infra_config` from line 17.)

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Line: 387-397

Comment:
**Silent breaking change for on-prem deployments**

The `cloud_provider == "onprem"` early-return guard has been removed, and `infer_registry_type` has no way to infer `"onprem"` from a prefix (it only detects ECR/ACR/GAR patterns, otherwise returning `"generic"`). Any on-prem deployment whose config has `cloud_provider: onprem` but has not yet added `docker_registry_type: onprem` will now silently:

1. Fall through to `GenericDockerRepository` (instead of `OnPremDockerRepository`).
2. Make live HTTP requests to an internal or airgapped registry for every `image_exists` check.
3. Likely fail with a `RequestException`, causing the check to return `False` and raise `DockerImageNotFoundException` — preventing new LLM endpoints from being created.

`OnPremDockerRepository.image_exists()` always returns `True`, so the old behaviour was intentional. The PR description acknowledges the migration requirement, but this is effectively a breaking change for any on-prem user who upgrades without updating their config.

Consider adding a transitional safety net: if `docker_registry_type` is unset *and* `cloud_provider == "onprem"`, fall back to `"onprem"` rather than `"generic"`:

```python
def infer_registry_type(prefix: str, cloud_provider: Optional[str] = None) -> str:
    if cloud_provider == "onprem":
        return "onprem"
    if ".dkr.ecr." in prefix and ".amazonaws.com" in prefix:
        return "ecr"
    ...
```

Or at minimum ensure the operator-facing upgrade notes are surfaced as a warning log at startup when `cloud_provider == "onprem"` and `docker_registry_type` is unset.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: ed9d942

Copy link
Collaborator

@dmchoiboi dmchoiboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; maybe address greptile comments

@lukasewecker lukasewecker enabled auto-merge (squash) March 12, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants