[draft / not ready for review] Add prefill/decode multifunction support in ET #16552

metascroy · 2026-01-13T01:11:35Z

Summary

This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach.

Key Changes

CoreML Backend Compiler (coreml_preprocess.py)

Added MULTIMETHOD_WEIGHT_SHARING_STRATEGY enum with NONE and POSITIONAL strategies
Added generate_multimethod_weight_sharing_strategy_compile_spec() to enable weight sharing across methods
Implemented multifunction CoreML model compilation using ct.utils.MultiFunctionDescriptor
When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods

Model Metadata (model_metadata.h, serde_json.mm)

Added MethodMetadata struct to store per-method input/output names for multifunction models
Extended ModelMetadata with methods map and default_method field
Added is_multifunction() helper to detect multifunction models
Updated JSON serialization to handle the new multifunction metadata format

Runtime Changes (ETCoreMLModelManager.mm, backend_delegate.mm, coreml_backend_delegate.mm)

Updated ETCoreMLModelManager to set functionName on MLModelConfiguration only for multifunction models (based on metadata.is_multifunction())
Legacy single-function models continue to work with functionName=nil
Added method name propagation through the delegate initialization path
Updated model loading to use per-method input/output names when available

Export Script (export_static_llm_coreml.py)

Added --multifunction flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods
Multifunction mode uses generate_full_logits=False for efficiency (only outputs last token logits)
Single method mode (default) retains generate_full_logits=True for lookahead decoding support
Generates combined metadata with method-specific prefixes (e.g., decode_input_len, prefill_input_len)

New Runner (run_static_llm_multifunction.py)

Added dedicated runner for multifunction models
Handles separate prefill and decode method execution
Manages cache state transfer between prefill and decode phases
Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output

Build System (CMakeLists.txt)

Fixed installation of CoreML backend headers

Utilities (extract_coreml_models.py)

Updated model extraction script to handle multifunction models

Documentation (README.md)

Added documentation for both export modes (single method and multifunction)
Added comprehensive export options reference table
Added usage examples for both modes

Usage Examples

Single Method Export (for lookahead decoding):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_model.pte \
    --input_len 32 \
    --max_context_len 1024

Multifunction Export (separate prefill/decode):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_multifunction.pte \
    --input_len 64 \
    --max_context_len 1024 \
    --multifunction

Run Single Method Model (with lookahead):

python examples/apple/coreml/llama/run_static_llm.py \
    --model static_llm_coreml_model.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --lookahead

Run Multifunction Model:

python examples/apple/coreml/llama/run_static_llm_multifunction.py \
    --model static_llm_coreml_multifunction.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --input_len 64 \
    --max_context_len 1024

Mode Comparison

Feature	Single Method	Multifunction
Sequence length	Fixed (input_len for both prefill & decode)	Separate (input_len for prefill, 1 for decode)
Logits output	Full (all tokens)	Last token only
Lookahead decoding	✅ Supported	❌ Not supported
Weight sharing	N/A	✅ Enabled
Generation efficiency	Good with lookahead	Optimized decode step

Test Plan

New unit test +

Tested both export modes on Llama 1B:

Exported single method model with --input_len 32 --max_context_len 1024
Exported multifunction model with --input_len 64 --max_context_len 1024 --multifunction
Ran single method model with --lookahead flag
Ran multifunction model with matching input_len and max_context_len
Verified text generation produces coherent output for both modes

pytorch-bot · 2026-01-13T01:11:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16552

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit 2b3decd with merge base a0ba28e ():

NEW FAILURES - The following jobs have failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t e2a56679f49dc899080fb7c3237760742d5aee2cc9d513710b590d1a6c3c6bd6 /exec failed with exit code 139
pull / test-samsung-models-linux / linux-job (gh)
test_inception_v3_fp16
pull / unittest / macos / macos-job (gh)
backends/apple/coreml/test/test_coreml_multifunction.py::TestCoreMLMultifunction::test_multifunction_with_kv_cache
pull / unittest-editable / macos / macos-job (gh)
backends/apple/coreml/test/test_coreml_multifunction.py::TestCoreMLMultifunction::test_multifunction_with_kv_cache
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 0a7ce23d81e0da66cb5af8205f02a215a840bd3fdf9ce900a29674b5fa32d150 /exec failed with exit code 1
trunk / unittest-release / macos / macos-job (gh)
backends/apple/coreml/test/test_coreml_multifunction.py::TestCoreMLMultifunction::test_multifunction_with_kv_cache

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-01-13T01:12:19Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

meta-codesync · 2026-01-14T22:22:15Z

@metascroy has imported this pull request. If you are a Meta employee, you can view this in D90716824.

Summary: This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach. ### Key Changes **CoreML Backend Compiler (`coreml_preprocess.py`)** - Added `MULTIMETHOD_WEIGHT_SHARING_STRATEGY` enum with `NONE` and `POSITIONAL` strategies - Added `generate_multimethod_weight_sharing_strategy_compile_spec()` to enable weight sharing across methods - Implemented multifunction CoreML model compilation using `ct.utils.MultiFunctionDescriptor` - When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods **Model Metadata (`model_metadata.h`, `serde_json.mm`)** - Added `MethodMetadata` struct to store per-method input/output names for multifunction models - Extended `ModelMetadata` with `methods` map and `default_method` field - Added `is_multifunction()` helper to detect multifunction models - Updated JSON serialization to handle the new multifunction metadata format **Runtime Changes (`ETCoreMLModelManager.mm`, `backend_delegate.mm`, `coreml_backend_delegate.mm`)** - Updated `ETCoreMLModelManager` to set `functionName` on `MLModelConfiguration` only for multifunction models (based on `metadata.is_multifunction()`) - Legacy single-function models continue to work with `functionName=nil` - Added method name propagation through the delegate initialization path - Updated model loading to use per-method input/output names when available **Export Script (`export_static_llm_coreml.py`)** - Added `--multifunction` flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods - Multifunction mode uses `generate_full_logits=False` for efficiency (only outputs last token logits) - Single method mode (default) retains `generate_full_logits=True` for lookahead decoding support - Generates combined metadata with method-specific prefixes (e.g., `decode_input_len`, `prefill_input_len`) **New Runner (`run_static_llm_multifunction.py`)** - Added dedicated runner for multifunction models - Handles separate prefill and decode method execution - Manages cache state transfer between prefill and decode phases - Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output **Build System (`CMakeLists.txt`)** - Fixed installation of CoreML backend headers **Utilities (`extract_coreml_models.py`)** - Updated model extraction script to handle multifunction models **Documentation (`README.md`)** - Added documentation for both export modes (single method and multifunction) - Added comprehensive export options reference table - Added usage examples for both modes ### Usage Examples **Single Method Export (for lookahead decoding):** ```bash python examples/apple/coreml/llama/export_static_llm_coreml.py \ --checkpoint $HOME/models/llama1b/llama1b.pth \ --params $HOME/models/llama1b/params.json \ --output static_llm_coreml_model.pte \ --input_len 32 \ --max_context_len 1024 ``` **Multifunction Export (separate prefill/decode):** ```bash python examples/apple/coreml/llama/export_static_llm_coreml.py \ --checkpoint $HOME/models/llama1b/llama1b.pth \ --params $HOME/models/llama1b/params.json \ --output static_llm_coreml_multifunction.pte \ --input_len 64 \ --max_context_len 1024 \ --multifunction ``` **Run Single Method Model (with lookahead):** ```bash python examples/apple/coreml/llama/run_static_llm.py \ --model static_llm_coreml_model.pte \ --params $HOME/models/llama1b/params.json \ --tokenizer $HOME/models/llama1b/tokenizer.model \ --prompt "Once upon a time" \ --max_new_tokens 100 \ --lookahead ``` **Run Multifunction Model:** ```bash python examples/apple/coreml/llama/run_static_llm_multifunction.py \ --model static_llm_coreml_multifunction.pte \ --params $HOME/models/llama1b/params.json \ --tokenizer $HOME/models/llama1b/tokenizer.model \ --prompt "Once upon a time" \ --max_new_tokens 100 \ --input_len 64 \ --max_context_len 1024 ``` ### Mode Comparison | Feature | Single Method | Multifunction | |---------|---------------|---------------| | Sequence length | Fixed (input_len for both prefill & decode) | Separate (input_len for prefill, 1 for decode) | | Logits output | Full (all tokens) | Last token only | | Lookahead decoding | ✅ Supported | ❌ Not supported | | Weight sharing | N/A | ✅ Enabled | | Generation efficiency | Good with lookahead | Optimized decode step | Test Plan: New unit test + Tested both export modes on Llama 1B: 1. Exported single method model with `--input_len 32 --max_context_len 1024` 2. Exported multifunction model with `--input_len 64 --max_context_len 1024 --multifunction` 3. Ran single method model with `--lookahead` flag 4. Ran multifunction model with matching input_len and max_context_len 5. Verified text generation produces coherent output for both modes Differential Revision: D90716824 Pulled By: metascroy

meta-codesync · 2026-01-15T01:06:16Z

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90716824.

metascroy requested review from cccclai, kirklandsign, larryliu0820 and shoumikhin as code owners January 13, 2026 01:11

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2026

metascroy added the ciflow/trunk label Jan 14, 2026

lucylq mentioned this pull request Jan 15, 2026

Coreml lora #16564

Open

facebook-github-bot force-pushed the multifunction branch from a9a71a5 to 2b3decd Compare January 15, 2026 01:06

meta-codesync bot added fb-exported meta-exported labels Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft / not ready for review] Add prefill/decode multifunction support in ET #16552

[draft / not ready for review] Add prefill/decode multifunction support in ET #16552

Uh oh!

metascroy commented Jan 13, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

meta-codesync bot commented Jan 14, 2026

Uh oh!

meta-codesync bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[draft / not ready for review] Add prefill/decode multifunction support in ET #16552

Are you sure you want to change the base?

[draft / not ready for review] Add prefill/decode multifunction support in ET #16552

Uh oh!

Conversation

metascroy commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Usage Examples

Mode Comparison

Test Plan

Uh oh!

pytorch-bot bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16552

❌ 6 New Failures, 1 Unrelated Failure

Uh oh!

github-actions bot commented Jan 13, 2026

This PR needs a release notes: label

Uh oh!

meta-codesync bot commented Jan 14, 2026

Uh oh!

meta-codesync bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

metascroy commented Jan 13, 2026 •

edited

Loading

pytorch-bot bot commented Jan 13, 2026 •

edited

Loading

This PR needs a `release notes:` label