-
Notifications
You must be signed in to change notification settings - Fork 5
Update claude md #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update claude md #78
Changes from all commits
a9e5c15
db0d2b5
7eba15b
a9c8b00
a627f88
6cb87fb
893c4a0
053760c
c8f23d8
10b1e50
00bbea5
dc04aea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| # Architecture | ||
|
|
||
| ## Granite Switch Model | ||
|
|
||
| The Granite Switch extends the base Granite model with: | ||
|
|
||
| ### 1. Embedded LoRA Adapters (frozen during inference) | ||
|
|
||
| Multiple task/domain-specific adapters are embedded in the same checkpoint. Each adapter has | ||
| LoRA weights (`lora_A`, `lora_B`) stacked in tensors and is activated via special control tokens | ||
| or router-selected indices. | ||
|
|
||
| ### 2. Control Tokens | ||
|
|
||
| Each adapter has a control token `<|adapter|>` that fires the switch. KV hiding uses | ||
| group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are | ||
| KV-hidden to prevent cross-request interference. | ||
|
|
||
| ### 3. Chat Template Integration | ||
|
|
||
| The tokenizer chat template maps adapter names to control tokens and places them automatically | ||
| based on adapter type: | ||
|
|
||
| - **ALORA adapters**: token placed either in the user message (by matching the invocation | ||
| sequence) or right before the generation prompt | ||
| - **LORA adapters**: token placed at sequence beginning | ||
|
|
||
| ### 4. Optional Trainable Router (SingleSwitch) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we adapt this section to better reflect how the switch really works? |
||
|
|
||
| SingleSwitch is a single attention head that uses a one-hot dim-0 pattern to compute per-token | ||
| adapter indices via attention-based cumsum. It has no decoder layers and no projection head — | ||
| only a vocab-size lookup table, so parameter cost is negligible relative to the full model. | ||
|
|
||
| --- | ||
|
|
||
| ## Two Backends | ||
|
|
||
| Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
|
||
| ### HuggingFace Backend (`granite_switch.hf`) | ||
|
|
||
| Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and | ||
| debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order | ||
| relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`). | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Enumeration is no longer correct |
||
|
|
||
| ### vLLM Backend (`granite_switch.vllm`) | ||
|
|
||
| Production inference backend (10-20x speedup). Uses Punica kernels for optimized LoRA | ||
| computation, PagedAttention for efficient KV cache, and supports continuous batching and | ||
| tensor/pipeline parallelism. Registered as a vLLM plugin via the `granite_switch.vllm` entry point. | ||
|
|
||
| --- | ||
|
|
||
| ## Key Configuration Fields | ||
|
|
||
| These fields are specific to Granite Switch and not present in base Granite: | ||
|
|
||
| | Field | Description | | ||
| |---|---| | ||
| | `num_adapters` | Number of embedded LoRA adapters | | ||
| | `adapter_token_ids` | Token IDs for each adapter's control token | | ||
| | `adapter_names` | Human-readable names for each adapter | | ||
| | `hiding_groups` | Named groups of adapters for KV hiding | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not up-to date. Are you referring to the object args or the model config? |
||
| | `hiding_policy` | Per-adapter KV hiding rules | | ||
| | `lora_rank` | LoRA rank (same for all adapters) | | ||
| | `lora_alpha` | LoRA alpha scaling factor | | ||
| | `control_dims` | Number of KV dimensions reserved for control | | ||
|
|
||
| ### Granite-Specific Parameters (inherited from base model) | ||
|
|
||
| - **`attention_multiplier`**: Attention score scaling (replaces `1/sqrt(head_dim)`) | ||
| - **`logits_scaling`**: Applied to final logits | ||
| - **`residual_multiplier`**: Applied to residual connections | ||
| - **`embedding_multiplier`**: Applied to input embeddings | ||
|
|
||
| Always load these from config — never hardcode. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,16 +16,13 @@ automatically from the HuggingFace `config.model_type` field. | |
| Any Granite model whose HuggingFace config has `model_type: granite` can be used | ||
| as a base model. The table below lists representative examples. | ||
|
|
||
| **Note:** Granite Switch currently supports single-GPU inference only. Models | ||
| that do not fit in a single GPU's memory are not yet supported. | ||
|
|
||
| #### Granite 4.x (`granite`) | ||
|
|
||
| | Model Tag | Size | Variant | | ||
| |---|---|---| | ||
| | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct | | ||
| | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct | | ||
| | `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct | | ||
| | `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. granite-4.0-micro is still supported |
||
|
|
||
| Base variants (`granite-4.1-3b-base`, `granite-4.1-8b-base`) are also supported. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # CLAUDE.md — composer/ | ||
|
|
||
| Compose system: builds Granite Switch checkpoints from a base model + LoRA adapters. Loaded | ||
| automatically when reading any file under `src/granite_switch/composer/`. | ||
|
|
||
| ## End-to-End Tests Must Use Compose Infrastructure | ||
|
|
||
| No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` directly. | ||
| All model construction must go through `GraniteSwitchComposer` so that the compose pipeline | ||
| itself is what's being tested. If the composer can't handle a use case (e.g., zero-adapter | ||
| skinning), extend the composer — don't work around it in tests. | ||
|
|
||
| ## Composing Models | ||
|
|
||
| ```bash | ||
| python -m granite_switch.composer.compose_granite_switch \ | ||
| --adapters ibm-granite/granitelib-rag-r1.0 | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # CLAUDE.md — hf/ | ||
|
|
||
| HuggingFace backend for training and debugging. Loaded automatically when reading any file under `src/granite_switch/hf/`. | ||
|
|
||
| ## HF Attention Backends and Causal Masking | ||
|
|
||
| The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask | ||
| (full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal` | ||
| attribute on the module. | ||
|
|
||
| The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work | ||
| on the current platform by probing each with a k=-inf GQA call at import time. Unavailable | ||
| backends are skipped. | ||
|
|
||
| ## Fused Projections (Not Bit-Exact with Upstream HF) | ||
|
|
||
| The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM | ||
| backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate | ||
| projections. Fused projections change the floating-point reduction order, so bit-exact skinning | ||
| equivalence with the upstream HF model is not achievable. The vLLM skinning equivalence tests | ||
| are the authoritative check — both the upstream and skinned models use the same fused-projection | ||
| architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are | ||
| skipped for this reason. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # CLAUDE.md — vllm/ | ||
|
|
||
| vLLM backend for production inference. Loaded automatically when reading any file under `src/granite_switch/vllm/`. | ||
|
|
||
| ## Adapter Index Convention (vLLM-specific) | ||
|
|
||
| Punica kernels use `-1` = no adapter. Internal conversion from the shared convention: | ||
| `adapter_indices - 1` (so the shared `0` = no adapter becomes `-1` for Punica). | ||
|
|
||
| ## Known Limitation: TP Row-Parallel Bias Doubling | ||
|
|
||
| `SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of | ||
| suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite | ||
| architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`. | ||
|
|
||
| ## Deployment | ||
|
|
||
| ```bash | ||
| # Verify plugin registration | ||
| python -c "from vllm.plugins import load_general_plugins; \ | ||
| from vllm import ModelRegistry; \ | ||
| load_general_plugins(); \ | ||
| print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')" | ||
|
|
||
| # Start API server | ||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model ./granite-with-all-aloras \ | ||
| --port 8000 | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # CLAUDE.md — tutorials/ | ||
|
|
||
| This file provides guidance when working on notebooks and guides in this directory. | ||
| Claude loads it automatically when reading any file under `tutorials/`. | ||
|
|
||
| ## Notebook Cell Ordering | ||
|
|
||
| Every notebook follows this cell order: | ||
|
|
||
| 1. `%pip install ...` — dependencies | ||
| 2. HF login cell (see below) | ||
| 3. Imports | ||
| 4. Configuration (model path, ports, constants) | ||
| 5. Long-running steps (corpus build, model load, vLLM launch) | ||
|
|
||
| ## HF Login Cell | ||
|
|
||
| Every notebook that downloads gated HF models (`ibm-granite/`) must have a dedicated cell | ||
| immediately after pip install: | ||
|
|
||
| ```python | ||
| from huggingface_hub import notebook_login | ||
| notebook_login() # needed to pull ibm-granite models from the Hub | ||
| ``` | ||
|
|
||
| Use cell id `hf-login-call` for consistency. | ||
|
|
||
| ## Duration Comments | ||
|
|
||
| Add `# Estimated duration: ~2 min on A100, ~7 min on T4` to cells that download models or | ||
| launch vLLM. Put these in **notebook cells only** — not in code files under `src/`. | ||
|
|
||
| ## Utility Modules | ||
|
|
||
| These live in `src/granite_switch/tutorials/` and are imported by notebooks: | ||
|
|
||
| - `vllm_server.py` — `launch_vllm()`, `wait_for_server()` (reads the vLLM log and prints | ||
| stage-based progress), `kill_stale_vllm_processes()` | ||
| - `chroma_loader.py` — `load_or_build_chroma()`: builds corpus on GPU, frees GPU memory with | ||
| `torch.cuda.empty_cache()`, then switches to CPU for queries so vLLM can use the full GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the fact that those are activated using the
adapter_namearg in apply_chat_template?