Skip to content

Declarative checkpoint config conversion#508

Open
jlamypoirier wants to merge 20 commits into
mainfrom
jlp_simplify_conversion
Open

Declarative checkpoint config conversion#508
jlamypoirier wants to merge 20 commits into
mainfrom
jlp_simplify_conversion

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented May 5, 2026

Summary

Replaces the imperative import_config / export_config methods on checkpoint converter classes with declarative ConfigConverter primitives keyed off FieldHint.architecture. Two recursive coverage walkers — one over Fast-LLM architecture fields, one over the HF dict — guarantee that every architecture-significant field and every HF dict key is accounted for on import and export. Unsupported feature detection is driven from a single source instead of scattered per-converter asserts.

What's in

  • Primitives (fast_llm/engine/checkpoint/external.py): Rename, ConstantImport, ConstantExport, Optional, Default, Ignored, ImportOnly, Custom, plus the recursive Nested / Dispatch / TypedDictContainer. Each declares the Fast-LLM paths and HF paths it consumes so the walkers can verify coverage on both sides.
  • ConfigSectionConverter ABC: cached _create_config_converters per class (subclasses extend the parent's dict by re-declaring keys), _validate_export hook for format-specific cross-field invariants, and recursive check_architecture_coverage / check_hf_coverage.
  • Architecture-hint reclassification (eight fields): attention dense_layer and softmax_scale_power, MLP activation, MoE router, Llama3 / Yarn rotary scaling parameters, StochasticMixer.main_mixer_name, vision patch_height / patch_width, and PatternBlockSequenceConfig.blocks. These are what the coverage walkers now require to be consumed.
  • GPT-side migration: llama, mistral, qwen2, mtp_llama, mixtral, diffusion_dream, diffusion_llama, apriel2 (text), and the apriel hybrid-SSM mixers (Mamba, GatedDeltaNet, KimiDeltaAttention).
  • Multimodal partial migration: Pixtral normalization and patch embeddings.
  • Static test (tests/models/test_converters.py): walks every registered handler's converter tree, runs check_architecture_coverage on each section node, and validates that every OptionalConfigConverter sentinel matches the field's resolved default (so an exported sentinel-equal value can't silently drift on re-import).

Notable shape decisions

  • Coverage check is type-strict (type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter, which kept partially migrated callers working through super() during the multi-step migration.
  • HF-side coverage walker consumes prefixes registered by every declaration, descending through Nested / Dispatch sub-converters under their hf_path. IgnoredConfigConverter(hf_paths=...) is the explicit opt-out for HF-only fields with no Fast-LLM counterpart (Mixtral router-runtime toggles, Qwen2 sliding-window, …).
  • _create_config_converters is @functools.cached. Subclasses must return a fresh dict ({**super()._create_config_converters(), ...}) — mutating the parent's returned dict would corrupt its cache entry. Documented on the base method.
  • IgnoredConfigConverter is recursive (recurses=True). Used for sub-configs with no architecture leaves (ParameterConfig, Mixtral's router sub-config) and for HF-only fields. Non-architecture fields (lr_scale, apply_peft, initialization sub-config) are by design not part of the HF round-trip; Fast-LLM keeps them on the in-memory config independently.

What's deferred

  • Multimodal Llava + multimodal Apriel2 imperative overrides (Apriel2VisionAttention/Block/Encoder, LlavaVisionModel*). They inherit ConfigSectionConverter and pass the static check via inherited declarations, but runtime conversion is hand-rolled — the proper fix is the deferred multimodal migration step.
  • Weight-converter declarative refactor. The weight side still uses today's WeightConverter subclasses (SplitWeightConverter, MLPLayer2Converter, KeyValueWeightConverter, …).
  • A handful of structural follow-ups recorded for future review rounds: IgnoredConfigConverter as a _ignored_fields ClassVar, get_converters signature uniformity, LlamaDecoderConverter as a ConfigSectionConverter, IgnoredConfigConverter default-round-trip maintenance test.

Test plan

  • pytest -v -n 6 tests/models/test_checkpoint.py tests/models/test_hf_roundtrip.py tests/models/test_converters.py: 177 passed, 93 skipped.
  • pytest -v -n 6 fast_llm_external_models/tests/: 371 passed, 1780 skipped (separate invocation per CLAUDE.md).
  • Manual smoke: fast-llm convert --input.format <fmt> --input.path <ref> --output.format <fmt> --output.path <tmp>; reload both and compare configs.

🤖 Generated with Claude Code

jlamypoirier and others added 3 commits May 5, 2026 18:33
Eight config fields whose values directly affect model architecture were
tagged as feature/core/(none). They drive the upcoming declarative-converter
coverage check, which uses FieldHint.architecture as the source of truth
for "must be handled by every checkpoint format".

- AttentionConfig.dense_layer (output projection presence)
- AttentionConfig.softmax_scale_power (attention scaling)
- MLPConfig.activation (forward-pass activation type)
- MoEMLPConfig.router (routing weights drive token assignment)
- Llama3RotaryConfig: scale_factor, low_frequency_factor,
  high_frequency_factor, original_context_length
- YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow,
  original_context_length
- StochasticMixerConfig.main_mixer_name (selects inference mixer)
- PatchEmbeddingsConfig.patch_height/patch_width (input tokenization)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reintroduces the declarative config-conversion shape that pre-dated PR #362,
applied within the post-#362 modular per-section structure. Replaces the
imperative import_config/export_config bodies with a small set of named
primitives and a recursive walker driven by per-section declarations.

Primitives in fast_llm.engine.checkpoint.external:
- RenameConfigConverter — 1:1 path rename
- ConstantExportConfigConverter — write constant on export, assert on import
- ConstantImportConfigConverter — assert on export, inject on import
- DefaultConfigConverter — rename with HF-side fallback
- OptionalConfigConverter — emit/import only when non-sentinel
- IgnoredConfigConverter — declare a field as intentionally not converted
- CustomConfigConverter — escape hatch for cross-field transforms
- NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges
  HF output into the parent (transformer side is assumed flat)
- DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs

ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses
declare their conversion via _create_config_converters() and inherit
import_config/export_config concretely. The architecture-coverage check fires
only when type(config) exactly matches the converter's declared
fast_llm_config_class — strict subclass types defer to a more specific
converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama)
to call super().export_config() without tripping the parent's check on
fields the parent doesn't know about.

The walker is implicit: NestedConfigConverter / DispatchConfigConverter
call the public import_config/export_config on the sub-converter class so
subclass overrides participate, rather than a private path that bypasses
them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pilot of the new ConfigSectionConverter framework. Each Llama section
converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel)
now declares its conversion via _create_config_converters() instead of
imperative import_config/export_config bodies. Weight side is unchanged.

Notable shape decisions:
- LlamaDecoderConverter stays as a regular (imperative) class because
  Fixed/Pattern block-sequence dispatch doesn't lend itself to the
  declarative shape. LlamaBaseModelConverter wires it in via a small
  CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...)
  continue to plug in different block converters via block_converter_class.
- _check_config is retained as an overridable classmethod and called from
  the linear_layers CustomConfigConverter, so Qwen2 can keep its
  asymmetric Q/K/V bias rule without re-implementing the export.
- IgnoredConfigConverter is used for ParameterConfig sub-fields with no
  architecture-significant content (weight, output_weight, word_embeddings),
  and for prediction_heads (which Llama HF doesn't expose; subclass
  MTP-Llama adds it imperatively).
- peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama
  HF format cannot represent PEFT, so a configured LoRA now fails loudly
  rather than being silently dropped.
- Rotary remains in CustomConfigConverter — the v4/v5 transformers split
  (rope_theta/rope_scaling vs. rope_parameters) and three rope_type
  variants don't fit pure rename primitives.

Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and
MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT
formats (139 passed, 0 failed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier force-pushed the jlp_simplify_conversion branch from 5567a71 to 0c406db Compare May 5, 2026 22:33
jlamypoirier and others added 16 commits May 6, 2026 07:14
Adds `_validate_export(cls, config)` classmethod hook on `ConfigSectionConverter`,
called automatically from `export_config` after the architecture-coverage check.
Replaces five `CustomConfigConverter`-as-validator blocks (`linear_layers`/`layers`
in attention and MLP, `position_embeddings` in embeddings, `peft` in base model,
plus the `_check_config` chain on attention) with `IgnoredConfigConverter` for
field-claiming + small `_validate_export` overrides. Mistral and Qwen2 rename
their `_check_config` overrides accordingly; Pixtral's imperative export updates
its `cls._check_config(config)` call site.

Also addresses several reviewer-flagged correctness/cleanup items:

- Drop the half-removed `parent_context` parameter from every primitive's
  `import_to` signature (and from `CustomConfigConverter`'s `import_fn`). It was
  unreachable through the walker.
- `_check_architecture_coverage` now reads `cls.fast_llm_config_class` directly
  instead of `getattr(..., None)`, surfacing missing class-attribute declarations
  as `AttributeError` rather than silently disabling the safety net.
- Drop the unused `hf_paths` parameter from `CustomConfigConverter.__init__`. There
  is no symmetric HF-side coverage check yet, so the field was cosmetic.
- Add a TODO note in `_check_architecture_coverage` documenting that the
  `MoEMLPConfig`/`MambaConfig`/etc. safety net is gated on later migrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dict of named per-block configs is unambiguously architecture
metadata; without an explicit hint it defaulted to `unknown`, hiding
it from the architecture-coverage check used by declarative checkpoint
converters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two additions, both required by Apriel2's nested HF schema:

- `NestedConfigConverter` gains an optional `hf_path` kwarg. When set,
  the sub-converter's output is placed under that nested key instead
  of being flat-merged. Existing flat-merge behavior is unchanged when
  `hf_path` is omitted.
- New `TypedDictContainerConfigConverter` for `dict[str, Config]`
  fields where each entry is round-tripped through a per-class
  section converter. Polymorphic dispatch via the entry's runtime
  type on export and the HF discriminator on import. A homogeneous
  mode (single registered class with `hf_type_name = None`) skips
  the discriminator entirely.

Both `DispatchConfigConverter` and `TypedDictContainerConfigConverter`
now also inject the Fast-LLM `dynamic_type_name` discriminator into
the imported sub-dict so the parent's `from_dict` dispatches to the
right `Config` subclass without a separate ConstantImport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-tests the framework's polymorphic dispatch and typed-dict
support: Apriel2's HF schema is nested (`decoder.block.mixer.{...}`,
`head.normalization`, `mixers.{name}`) and the mixer field is
heterogeneously polymorphic (Attention/Mamba/StochasticMixer/GDN/KDA).

Migrated converters: per-mixer (Attention/Mamba/GDN/KDA), the
StochasticMixer container (driven by TypedDictContainer over a
leaf-mixer registry), per-normalization (RMS/LayerNorm/NoNorm), MLP,
Block, Fixed/Pattern decoder variants (selected by Dispatch on
runtime BlockSequenceConfig type), Head, and BaseModel.

The imperative weight-side `get_converters` methods are preserved
unchanged so the multimodal Apriel2 converter (which inherits from
the text-only one) keeps working without modification.

PatternDecoder's `blocks` dict uses the homogeneous mode of
TypedDictContainer (single-class registry, no discriminator). The
attention rotary-type translation (default ↔ mistral_1d) and Mamba's
auxiliary HF fields (d_conv, conv_bias, dt_proj_bias derived from
linear-config bias flags) remain on `CustomConfigConverter` since
they're shape-changing transforms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…primitives

Each format inherits Llama's `_create_config_converters` and replaces only the
fields that diverge:
  * Mistral: ConstantImportConfigConverter pinning `add_linear_biases=False` for
    attention and MLP (HF format has no `attention_bias`/`mlp_bias`); rename
    `window_size` <-> `sliding_window`.
  * Qwen2: ConstantImportConfigConverter for `add_linear_biases`; CustomConfigConverter
    for `head_size` (no `head_dim` HF field, derive on import); CustomConfigConverter
    for per-layer biases (always Q/K/V=True, dense=False); the head_dim relationship
    `heads * head_size == hidden_size` moves to `_validate_export` on the base-model
    converter; the use_mrope guard moves to `import_config`.
  * MTP-Llama: RenameConfigConverter for `prediction_heads` (Llama blanket-ignores it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`MixtralMLPConverter` switches its `fast_llm_config_class` to `MoEMLPConfig` so the
architecture-coverage check sees MoE-specific fields. The config-side overrides:
  * `add_linear_biases` -> ConstantImportConfigConverter (Mixtral has no `mlp_bias`).
  * `experts` <-> `num_local_experts` and `experts_per_token` <-> `num_experts_per_tok`
    via RenameConfigConverter.
  * `shared_experts=0` and `routing=topk` pinned via ConstantImportConfigConverter so
    they round-trip cleanly without an HF representation.
  * `router` covered by IgnoredConfigConverter (Mixtral's gate is a default `LinearConfig`).
The Fast-LLM dynamic-type discriminator (`type: "moe"`) is injected via an `import_config`
override since the MLP is wrapped via `NestedConfigConverter` rather than `DispatchConfigConverter`.

Diffusion-Dream and Diffusion-Llama need no migration: they only override `architecture`,
`get_transformers_configuration_class`, and `_export_config` (auto_map). They inherit the
declarative converters from their parents (Qwen2 and Llama).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itives

`AprielMambaConverter`, `GatedDeltaNetConverter`, and `KimiDeltaAttentionConverter` become
`ConfigSectionConverter` subclasses with their HF-side fields nested under the appropriate
HF subkey (`ssm_cfg` for Mamba, `linear_attn_config` for GDN/KDA).

Mamba's three sibling-default fields (`d_inner`, `d_xb`, `dt_rank`) read the HF root's
`hidden_size` directly via `DefaultConfigConverter.hf_default_fn` / `CustomConfigConverter`,
removing the need for an explicit `parent_context` plumbing through the framework. The
per-layer convolution and dt biases use `CustomConfigConverter` to pick up the mixer-wide
`add_linear_biases` fallback when unset; the existing `_check_config` per-layer assertions
move to `_validate_export`.

`AprielBlockConverter` (the per-block dispatcher) and `AprielDecoderConverter` (the
`hybrid_block_layout` driver) stay imperative because Apriel's HF format encodes the
mixer type in a parent-level list rather than a per-block discriminator, which
`DispatchConfigConverter` doesn't model. The `type: "mamba"`/`"gdn"`/`"kda"` Fast-LLM
discriminator is injected via a one-line `import_config` override on each leaf converter
(same pattern Mixtral uses).

The HF format has no test coverage in `tests/models/test_checkpoint.py` or
`tests/models/test_hf_roundtrip.py`, so verification was a synthesized live round-trip
covering each mixer leaf plus a hybrid attention+Mamba pattern decoder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…larative primitives

`PixtralNormalizationConverter` collapses to a single `_create_config_converters` override
that pins `epsilon=1e-5` via `ConstantImportConfigConverter` (asserts on export, injects
on import; no HF write). `PixtralEmbeddingsConverter` becomes a `ConfigSectionConverter`
with declarations for `patch_height` (rename to `patch_size`), `patch_width` (mirror
`patch_size` on import), `num_channels` (export-only constant 3), nested `normalization`,
and an `IgnoredConfigConverter` for `patch_embeddings`. The `patch_height == patch_width`
and `patch_embeddings.bias.enabled in (None, False)` checks move to `_validate_export`.

The remaining Llava and Apriel2 multimodal converters stay imperative: they're cross-section
aggregators (vision_config + text_config + top-level merge) whose shape doesn't fit a single
ConfigSectionConverter, often with parent-context dependencies (e.g., the adapter's
intermediate_size derives from the text model's hidden_size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CopyWeightConverter` was defined in `external.py` but never instantiated; deleted.
`QueryWeightConverter` was a no-op identity (its `export_weight`/`import_weight` just
unwrap and rewrap); replaced with the default `WeightConverter` at all three call
sites (Llama, Qwen2, Apriel2 attention) and removed the redundant `config` arg.

The broader weight-side refactor (declarative `WeightConverter` primitives, walker-driven
`drop_on_export` removal) is deferred — out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix asymmetric round-trip in `Apriel2MambaConverter`: the `aux` declaration's import_fn
  now reads `d_conv` / `conv_bias` / `dt_proj_bias` back into `convolution_layer.kernel_size`,
  `convolution_layer.bias.enabled`, and `dt_layer.bias.enabled`. Previously these HF fields
  were dropped on import, which silently masked HF conv1d/dt_proj bias weights when they
  diverged from the mixer-wide `add_linear_biases` flag (parallel to the apriel.py mamba
  migration earlier in this PR).
- Drop the stale TODO from `_check_architecture_coverage`'s docstring (the migrations it
  referred to have all landed in this PR); reword the surrounding comment to describe
  the current strict-subtype handling.
- Combine adjacent f-strings in `DispatchConfigConverter`'s import-error message.
- Hoist `StochasticMixerSamplingStrategy` to the module-level import in `apriel2.py`;
  it was being re-imported on every `_create_config_converters` call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Recursive architecture-coverage walker (item 1): the section-level check now
  collects every architecture-hint path under the active config tree and matches
  each against the declarations. Recursive primitives
  (Nested/Dispatch/TypedDictContainer/Ignored, plus Custom/ImportOnly when the
  author opts in) cover whole subtrees by prefix; non-recursive ones must list
  every leaf they consume. Fixes the silent-drop class of bug previously masked
  for any sub-config field claimed by a flat CustomConfigConverter.

- Apriel2 rotary export bug fix (motivating leak for item 1): the export now
  emits the Llama3/Yarn scale parameters that round-trip via the pass-through
  import, instead of silently dropping them.

- Pixtral attention migrated to declarative form (item 3): _create_config_converters
  overrides instead of an imperative export_config that bypassed the coverage
  check.

- Apriel2 weight side cleanup (items 5, 6, 12): Apriel2MLPConverter owns its
  weight converters and the block delegates; the imperative Apriel2DecoderConverter
  is gone, replaced by per-shape get_converters on
  Apriel2FixedDecoderConverter / Apriel2PatternDecoderConverter dispatched via
  APRIEL2_DECODER_REGISTRY.

- ImportOnlyConfigConverter primitive (item 11) collapses three asymmetric
  CustomConfigConverter sites in qwen2.py and llava.py.

- Helper consolidation: drop external.py's _get_nested/_has_nested in favour of
  fast_llm.config.get_nested_dict_value (item 7); share assert_no_peft between
  Llama and Apriel2 base-model converters (item 10).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Subtree drops are now visible at the declaration site (review item 1).
  Five Custom converters previously claimed a sub-config subtree via
  ``recurses=True`` while only round-tripping a fraction of its architecture
  leaves; each is now non-recursive (lists every leaf it actually round-trips)
  with sibling ``IgnoredConfigConverter`` entries for the leaves the format
  drops on purpose. Sites: Apriel mamba ``convolution_layer`` and ``dt_layer``,
  Apriel2 GDN ``convolution_layer``, Apriel2 KDA ``convolution_layer`` and
  ``normalization``.

- Architecture-coverage walker now descends into ``dict[str, Config]`` and
  list/tuple-of-Config fields (item 2). Previously masked by
  ``TypedDictContainerConfigConverter.recurses=True``; the walker now matches
  what the docstring claims.

- Coverage error gains a hint when missing paths share a top-level prefix that
  is claimed non-recursively (item 3 — message half only): suggests
  Nested/Dispatch or ``recurses=True`` on Custom/ImportOnly. No new ``recurses``
  kwarg on the base primitives.

- Single ``effective_bias(layer_config, default)`` helper in llama.py replaces
  three near-duplicates (item 4): ``_resolve_bias_enabled`` in apriel.py,
  ``_get_effective_bias`` in apriel2.py, and the inline ternary in
  ``Apriel2MLPConverter``.

- Apriel2 decoder dispatch lookup lifted into module-level
  ``get_apriel2_decoder_converter(decoder)`` (item 6); used by both the text
  and multimodal base-model converters.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Apriel2 decoder converters expose `block_converter_class` ClassVar so subclasses can swap the
  block converter, mirroring the LlamaDecoderConverter polymorphism pattern.
* `_create_config_converters` is memoized via `functools.cache` (keyed by cls), so per-class
  declarations are built once. Convert two `out = super(); out[k] = v` mutation patterns
  (qwen2, llava) to spread+new-dict so the cached parent dict is never mutated.
* `NestedConfigConverter` auto-injects the HF `type` discriminator from the target
  converter's `hf_type_name`, mirroring `DispatchConfigConverter`/`TypedDictContainer`.
  Drops a manual `ConstantExportConfigConverter` from `Apriel2MLPConverter`.
* Move architecture-coverage check to `tests/models/test_converters.py`, parametrized
  per-format. Walks each `HuggingfaceStateDictCheckpointHandler.base_model_converter_class`
  through the modular converter tree (Nested/Dispatch/TypedDict + `*_converter_class` ClassVars)
  and runs `check_architecture_coverage` on each `ConfigSectionConverter` node. The
  per-export runtime invocation is removed.
* Same test verifies `OptionalConfigConverter` sentinels match the resolved field default —
  catches silent round-trip drift if a Fast-LLM default changes.
* Two latent bugs surfaced and fixed by the new test:
  * `apriel.py` GDN/KDA converters were missing `convolution_layer` architecture claims.
  * `Apriel2MambaConverter.d_xb`/`dt_rank` misused `OptionalConfigConverter`
    (sentinel=None on a non-Optional int) - converted to `RenameConfigConverter`.

Deferred to follow-up commit: HF-side coverage check on every import (item 10) - needs
`hf_paths` audit across ~20 Custom/ImportOnly call sites and a flat-merge-aware walker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Symmetric counterpart to the architecture-coverage check (already a test). Walks the HF
config dict at the import boundary and raises on any key not consumed by some declaration
in the converter tree. Catches transformers-version drift, manual edits, and corrupted
configs at the point of import rather than as cryptic downstream failures.

* ``ConfigConverter`` primitives gain a recursive ``_consumed_hf_paths`` walker. Nested/
  Dispatch/TypedDictContainer with a fixed ``hf_path`` claim it as a subtree prefix; their
  flat-merge variants (``hf_path=None``) pull the sub-converter's claims up to the current
  level so a parent's check sees them.
* ``CustomConfigConverter`` / ``ImportOnlyConfigConverter`` gain an ``hf_paths`` kwarg;
  every existing call site is audited and populated. ``IgnoredConfigConverter`` gains an
  ``hf_paths`` kwarg used for HF-only fields Fast-LLM intentionally does not consume
  (Mixtral router toggles, Qwen2 sliding-window machinery, Apriel2's default-injected
  ``embeddings`` subdict from ``Apriel2TextConfig``).
* ``HuggingfaceStateDictCheckpointHandler`` runs the check from ``_import_config`` against
  the base-model converter. A class-level allowlist covers transformers' generic
  ``PretrainedConfig`` fields and inference-only metadata that's always permitted. The
  ``Apriel2`` text handler's override is updated to call the shared ``_check_hf_coverage``
  helper. Non-``ConfigSectionConverter`` base-model converters (Llava aggregators) skip the
  check transparently.
* ``LlamaBaseModelConverter``'s decoder Custom - which wraps the imperative
  ``LlamaDecoderConverter`` - auto-extends its ``hf_paths`` from the block converter's
  ``_consumed_hf_paths``, so Mistral/Mixtral/Qwen2/MTPLlama/Apriel inherit correct
  coverage. ``AprielBlockConverter`` (per-block-type dispatcher, also imperative) gets its
  own ``_consumed_hf_paths`` that unions across registered per-mixer block converters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apriel2BlockConverter._validate_export asserts type(config.mlp) is MLPConfig,
  restoring the pre-PR rejection of MoEMLPConfig that NestedConfigConverter
  would otherwise silently descend through (dropping experts/routing/router).
- _consumed_hf_paths now expands a nested sub-converter's claims under its
  hf_path prefix (NestedConfigConverter/DispatchConfigConverter with hf_path
  set) so check_hf_coverage descends and flags unknown keys deep inside
  apriel2's head/decoder, llava's vision_config, etc.
- Pin prediction_heads to 1 in Llama and Apriel2 head converters via
  ConstantImportConfigConverter so non-default values fail on export instead
  of silently dropping (MTP-Llama overrides the entry with Rename).
- Document the cache-mutation hazard on _create_config_converters: subclasses
  must spread the parent's dict, never mutate it in place.
- Narrow Apriel2BaseModelConverter's HF embeddings Ignored to the single
  injected leaf so future transformers fields in the same subdict trip the
  coverage check.
- Tighten Mixtral router Ignored comment to record the structural rationale
  (router.weight has no architecture sub-fields, so the blanket claim is
  equivalent to the narrowest possible non-recursive claim).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surface cleanups from the fine-pass review: rename ``cur`` → ``current`` in
``_get_attr_path``, merge an unintentionally split f-string in the
``DispatchConfigConverter`` error path, switch bare ``return`` to ``pass`` in
empty ``-> None`` converter bodies, type-annotate ``_per_layer_bias_export``
and ``get_apriel2_decoder_converter`` (dropping a redundant forward-ref
quote), and replace ``<->`` with ``↔`` in the remaining converter docstrings
for consistency across the migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jlamypoirier jlamypoirier changed the title Declarative checkpoint config conversion (Llama pilot) Declarative checkpoint config conversion May 12, 2026
Round 6 picks up one latent correctness bug, consolidates duplicated
declarations into framework primitives, and tidies several surface items.

* ``Apriel2HeadConverter._validate_export`` now asserts ``RMSNormalizationConfig``:
  the config side dispatches normalization through ``APRIEL2_NORM_REGISTRY`` while
  the weight side hardcoded RMS, so a LayerNorm/NoNorm head would have silently
  dropped its bias on convert.
* ``ConfigSectionConverter.import_config`` injects ``{"type": <dynamic_type_name>}``
  from ``fast_llm_config_class`` automatically, removing the redundant injection
  from ``NestedConfigConverter`` / ``TypedDictContainerConfigConverter`` and
  collapsing four hand-rolled overrides (Apriel mamba/gdn/kda + Mixtral moe).
* Deleted ``MTPLlamaDecoderConverter`` — its overrides were byte-identical to the
  parent's after the migration, with the only diff being a Pattern restriction
  that the parent now handles correctly through the multi-block-equality branch.
* Extracted ``_per_layer_bias_converter`` and ``_apriel2_conv_kernel_converter``
  helpers in apriel2.py to collapse pairs of byte-identical CustomConfigConverter
  declarations.
* ``AprielBlockConverter._consumed_hf_paths`` gets ``@functools.cache`` for parity
  with the base ``ConfigSectionConverter._consumed_hf_paths``.
* ``effective_bias`` typed as ``AffineLinearConfig``; ``NoPeftConfig`` import
  moved to top of llama.py (the module is not a config module subject to the
  heavy-import rule); stale ``# TODO: Peft?`` removed.
* CLAUDE.md naming convention clarified: single underscore covers non-public
  (private or protected), matching the project's actual usage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant