generative-computing · freunda · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
@@ -1,8 +1,10 @@
 # OS files
 .DS_Store
 
-# Claude Code
+# Claude Code — ignore local settings but track shared skills
 .claude/
+!.claude/skills/
+!.claude/skills/**
 
 # Python cache files
 *.pyc

@@ -0,0 +1,76 @@
+# Architecture
+
+## Granite Switch Model
+
+The Granite Switch extends the base Granite model with:
+
+### 1. Embedded LoRA Adapters (frozen during inference)
+
+Multiple task/domain-specific adapters are embedded in the same checkpoint. Each adapter has
+LoRA weights (`lora_A`, `lora_B`) stacked in tensors and is activated via special control tokens
+or router-selected indices.
+
+### 2. Control Tokens
+
+Each adapter has a control token `<|adapter|>` that fires the switch. KV hiding uses
+group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are
+KV-hidden to prevent cross-request interference.
+
+### 3. Chat Template Integration
+
+The tokenizer chat template maps adapter names to control tokens and places them automatically
+based on adapter type:
+
+- **ALORA adapters**: token placed either in the user message (by matching the invocation
+  sequence) or right before the generation prompt
+- **LORA adapters**: token placed at sequence beginning
+
+### 4. Optional Trainable Router (SingleSwitch)
+
+SingleSwitch is a single attention head that uses a one-hot dim-0 pattern to compute per-token
+adapter indices via attention-based cumsum. It has no decoder layers and no projection head —
+only a vocab-size lookup table, so parameter cost is negligible relative to the full model.
+
+---
+
+## Two Backends
+
+Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`).
+
+### HuggingFace Backend (`granite_switch.hf`)
+
+Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and
+debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order
+relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`).
+
+### vLLM Backend (`granite_switch.vllm`)
+
+Production inference backend (10-20x speedup). Uses Punica kernels for optimized LoRA
+computation, PagedAttention for efficient KV cache, and supports continuous batching and
+tensor/pipeline parallelism. Registered as a vLLM plugin via the `granite_switch.vllm` entry point.
+
+---
+
+## Key Configuration Fields
+
+These fields are specific to Granite Switch and not present in base Granite:
+
+| Field | Description |
+|---|---|
+| `num_adapters` | Number of embedded LoRA adapters |
+| `adapter_token_ids` | Token IDs for each adapter's control token |
+| `adapter_names` | Human-readable names for each adapter |
+| `hiding_groups` | Named groups of adapters for KV hiding |
+| `hiding_policy` | Per-adapter KV hiding rules |
+| `lora_rank` | LoRA rank (same for all adapters) |
+| `lora_alpha` | LoRA alpha scaling factor |
+| `control_dims` | Number of KV dimensions reserved for control |
+
+### Granite-Specific Parameters (inherited from base model)
+
+- **`attention_multiplier`**: Attention score scaling (replaces `1/sqrt(head_dim)`)
+- **`logits_scaling`**: Applied to final logits
+- **`residual_multiplier`**: Applied to residual connections
+- **`embedding_multiplier`**: Applied to input embeddings
+
+Always load these from config — never hardcode.
@@ -16,16 +16,13 @@ automatically from the HuggingFace `config.model_type` field.
 Any Granite model whose HuggingFace config has `model_type: granite` can be used
 as a base model. The table below lists representative examples.
 
-**Note:** Granite Switch currently supports single-GPU inference only. Models
-that do not fit in a single GPU's memory are not yet supported.
-
 #### Granite 4.x (`granite`)
 
 | Model Tag | Size | Variant |
 |---|---|---|
 | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct |
 | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct |
-| `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct |
+| `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct |
 
 Base variants (`granite-4.1-3b-base`, `granite-4.1-8b-base`) are also supported.
 

@@ -0,0 +1,18 @@
+# CLAUDE.md — composer/
+
+Compose system: builds Granite Switch checkpoints from a base model + LoRA adapters. Loaded
+automatically when reading any file under `src/granite_switch/composer/`.
+
+## End-to-End Tests Must Use Compose Infrastructure
+
+No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` directly.
+All model construction must go through `GraniteSwitchComposer` so that the compose pipeline
+itself is what's being tested. If the composer can't handle a use case (e.g., zero-adapter
+skinning), extend the composer — don't work around it in tests.
+
+## Composing Models
+
+```bash
+python -m granite_switch.composer.compose_granite_switch \
+  --adapters ibm-granite/granitelib-rag-r1.0
+```
@@ -0,0 +1,23 @@
+# CLAUDE.md — hf/
+
+HuggingFace backend for training and debugging. Loaded automatically when reading any file under `src/granite_switch/hf/`.
+
+## HF Attention Backends and Causal Masking
+
+The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask
+(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal`
+attribute on the module.
+
+The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work
+on the current platform by probing each with a k=-inf GQA call at import time. Unavailable
+backends are skipped.
+
+## Fused Projections (Not Bit-Exact with Upstream HF)
+
+The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM
+backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate
+projections. Fused projections change the floating-point reduction order, so bit-exact skinning
+equivalence with the upstream HF model is not achievable. The vLLM skinning equivalence tests
+are the authoritative check — both the upstream and skinned models use the same fused-projection
+architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are
+skipped for this reason.
@@ -0,0 +1,29 @@
+# CLAUDE.md — vllm/
+
+vLLM backend for production inference. Loaded automatically when reading any file under `src/granite_switch/vllm/`.
+
+## Adapter Index Convention (vLLM-specific)
+
+Punica kernels use `-1` = no adapter. Internal conversion from the shared convention:
+`adapter_indices - 1` (so the shared `0` = no adapter becomes `-1` for Punica).
+
+## Known Limitation: TP Row-Parallel Bias Doubling
+
+`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of
+suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite
+architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`.
+
+## Deployment
+
+```bash
+# Verify plugin registration
+python -c "from vllm.plugins import load_general_plugins; \
+           from vllm import ModelRegistry; \
+           load_general_plugins(); \
+           print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')"
+
+# Start API server
+python -m vllm.entrypoints.openai.api_server \
+  --model ./granite-with-all-aloras \
+  --port 8000
+```
@@ -0,0 +1,40 @@
+# CLAUDE.md — tutorials/
+
+This file provides guidance when working on notebooks and guides in this directory.
+Claude loads it automatically when reading any file under `tutorials/`.
+
+## Notebook Cell Ordering
+
+Every notebook follows this cell order:
+
+1. `%pip install ...` — dependencies
+2. HF login cell (see below)
+3. Imports
+4. Configuration (model path, ports, constants)
+5. Long-running steps (corpus build, model load, vLLM launch)
+
+## HF Login Cell
+
+Every notebook that downloads gated HF models (`ibm-granite/`) must have a dedicated cell
+immediately after pip install:
+
+```python
+from huggingface_hub import notebook_login
+notebook_login()  # needed to pull ibm-granite models from the Hub
+```
+
+Use cell id `hf-login-call` for consistency.
+
+## Duration Comments
+
+Add `# Estimated duration: ~2 min on A100, ~7 min on T4` to cells that download models or
+launch vLLM. Put these in **notebook cells only** — not in code files under `src/`.
+
+## Utility Modules
+
+These live in `src/granite_switch/tutorials/` and are imported by notebooks:
+
+- `vllm_server.py` — `launch_vllm()`, `wait_for_server()` (reads the vLLM log and prints
+  stage-based progress), `kill_stale_vllm_processes()`
+- `chroma_loader.py` — `load_or_build_chroma()`: builds corpus on GPU, frees GPU memory with
+  `torch.cuda.empty_cache()`, then switches to CPU for queries so vLLM can use the full GPU
@@ -15,7 +15,7 @@ Setup requirements for running Granite Switch tutorials.
 
 ### Python Version
 
-Python 3.10+ is required.
+Python 3.11–3.13 is required.
 
 ### Base Installation
 

@@ -183,7 +183,7 @@ The base model's tokenizer and generation assets (`generation_config.json`, `mer
 
 ## Step 4: Use the Composed Model
 
-> **Note:** Custom (BYOA) adapters are not supported by [Mellea](https://github.com/generative-computing/mellea). Mellea only supports the official IBM Granite Library adapters. To invoke your custom adapters, use the chat template directly as shown below.
+> **Note:** The high-level Mellea wrappers (`guardian_check`, `rag.rewrite_question`, etc.) are built for the official IBM Granite Library adapters. Custom adapters can be invoked through Mellea's lower-level `Intrinsic` API — see [Bring Your Own Adapter with Mellea](mellea_build_your_own_adapter.md). To invoke adapters without Mellea at all, use the chat template directly as shown below.
 
 ### With HuggingFace