diff --git a/.gitignore b/.gitignore
index a5dcceb..9417ad2 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,8 +1,10 @@
 # OS files
 .DS_Store
 
-# Claude Code
+# Claude Code — ignore local settings but track shared skills
 .claude/
+!.claude/skills/
+!.claude/skills/**
 
 # Python cache files
 *.pyc
diff --git a/CLAUDE.md b/CLAUDE.md
index 404065e..1326482 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,116 +4,21 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 ## Repository Overview
 
-**granite-switch** implements **Granite Switch**, a system for building and deploying Granite models with embedded LoRA adapters. The system is a single unified Python package (`granite_switch`) with optional extras for different backends.
-
-1. **Building models with embedded adapters** - Combine a base Granite model with multiple LoRA adapters into a single checkpoint
-2. **Automatic adapter control** - Activate adapters via special control tokens or chat templates
-3. **Fast inference** - Deploy with vLLM for speedup over standard HuggingFace inference
-4. **Optional trainable switching** - Train a router to automatically select adapters per-token
+**granite-switch** is a single Python package (`granite_switch`) for building and deploying Granite models with embedded LoRA adapters. Two backends share the same weight format: `granite_switch.hf` (HuggingFace, training) and `granite_switch.vllm` (production inference, 10-20x speedup via Punica kernels + PagedAttention).
 
 ## Project Structure
 
-```
-granite-switch/
-├── pyproject.toml                       # Single package definition with optional extras
-├── src/
-│   └── granite_switch/                  # Unified package
-│       ├── __init__.py                  # Core exports (GraniteSwitchConfig, __version__)
-│       ├── config.py                    # Unified GraniteSwitchConfig
-│       │
-│       ├── composer/                    # Compose system (requires [compose] extra)
-│       │   ├── __init__.py
-│       │   ├── adapter_discovery.py     # Adapter discovery and resolution
-│       │   ├── adapter_loader.py        # Adapter weight loading
-│       │   ├── arch.py                  # Architecture definitions
-│       │   ├── compose_granite_switch.py  # Main compose script (CLI entry point)
-│       │   ├── compose_utils.py           # GraniteSwitchComposer class
-│       │   ├── tokenizer_setup.py       # Tokenizer configuration for control tokens
-│       │   ├── validator.py             # Compose validation checks
-│       │   ├── weight_remapper.py       # Adapter name remapping (AdapterRemapper)
-│       │   ├── weight_transfer.py       # Base model weight transfer
-│       │   └── reporting/               # Compose reporting utilities
-│       │       ├── __init__.py
-│       │       ├── adapter_analysis.py
-│       │       ├── compose_report.py
-│       │       ├── hiding_constant_report.py
-│       │       ├── model_card.py
-│       │       └── population_table.py
-│       │
-│       ├── hf/                          # HuggingFace backend (requires [hf] extra)
-│       │   ├── __init__.py              # Registers with transformers AutoConfig/AutoModel
-│       │   ├── modeling_granite_switch.py
-│       │   ├── core/
-│       │   │   ├── __init__.py
-│       │   │   └── lora.py              # SwitchedLoRALinear, MergedSwitchedLoRALinear
-│       │   └── switch/
-│       │       ├── __init__.py
-│       │       └── single.py            # SingleSwitch (HF attention backends)
-│       │
-│       └── vllm/                        # vLLM backend (requires [vllm] extra)
-│           ├── __init__.py              # register() for vLLM plugin system
-│           ├── granite_switch_model.py
-│           ├── core/
-│           │   ├── __init__.py
-│           │   ├── lora.py              # SwitchedLoRALinear (Punica kernels)
-│           │   ├── lora_kernel_meta.py
-│           │   └── decoder.py           # Decoder layers
-│           └── switch/
-│               ├── __init__.py
-│               └── single.py            # SingleSwitch (vLLM Attention)
-│
-├── tests/                               # All tests
-│   ├── unit/                            # Unit tests (fastest, CPU)
-│   ├── hf/                              # HuggingFace-specific tests
-│   ├── vllm/                            # vLLM-specific tests
-│   ├── composer/                        # Compose system tests
-│   ├── integration/                     # Cross-backend integration tests
-│   ├── regression/                      # Regression tests (hf/, vllm/, integration/, shared/, tools/)
-│   └── shared/                          # Shared test utilities and parametrized cases
-│
-├── scratch/                             # Throwaway debug/diagnostic scripts (gitignored)
-├── docs/                                # Documentation
-├── tutorials/                           # Tutorials and how-to guides
-├── CLAUDE.md                            # This file
-└── README.md
-```
+Key layout rules — full tree via `find src/` or `find tests/`:
+
+- `src/granite_switch/` — unified package; `composer/`, `hf/`, `vllm/` match the optional extras
+- `tests/` — official test suite only; subdirs: `unit/`, `hf/`, `vllm/`, `composer/`, `integration/`, `regression/`, `shared/`
+- `tutorials/` — notebooks and guides; see `tutorials/CLAUDE.md` for conventions
 
 ## Installation (local/dev)
 
 ```bash
-# Core package only (config)
-pip install -e .
-
-# With HuggingFace backend
-pip install -e ".[hf]"
-
-# With vLLM backend
-pip install -e ".[vllm]"
-
-# With compose tools
-pip install -e ".[compose]"
-
-# Everything (development)
-pip install -e ".[dev]"
-```
-
-## Import Paths
-
-```python
-# Config (shared by all backends)
-from granite_switch import GraniteSwitchConfig
-from granite_switch.config import GraniteSwitchConfig  # equivalent
-
-# HuggingFace backend
-from granite_switch.hf import GraniteSwitchForCausalLM
-from granite_switch.hf.core.lora import SwitchedLoRALinear
-from granite_switch.hf.switch.single import SingleSwitch
-
-# vLLM backend (auto-registered via plugin entry point)
-from granite_switch.vllm import register
-
-# Compose system
-from granite_switch.composer import GraniteSwitchComposer
+pip install -e ".[dev]"         # everything (recommended for development)
+pip install -e ".[hf,compose]"  # HF + composer only (no vLLM)
 ```
 
 ## File Organization Convention
@@ -129,50 +34,20 @@ from granite_switch.composer import GraniteSwitchComposer
 
 ### Test Files (Python)
 
-**All `test_*.py` test files MUST go in a `tests/` directory:**
-
-- **`tests/unit/`**: Unit tests (fastest, CPU-only)
-- **`tests/hf/`**: HuggingFace implementation tests
-- **`tests/vllm/`**: vLLM implementation tests
-- **`tests/composer/`**: Compose system tests
-- **`tests/integration/`**: Cross-implementation and end-to-end integration tests
-- **`tests/regression/`**: Regression tests (hf/, vllm/, integration/, shared/, tools/)
-- **`tests/shared/`**: Shared test utilities and parametrized cases
-
-**IMPORTANT: `tests/` is for official regression tests ONLY.** Do NOT place throwaway diagnostic,
-debugging, or exploratory scripts in `tests/`. Use `scratch/` instead (it is gitignored). Running
-`pytest tests/` should only execute curated, maintained tests — never one-off investigations.
+**`tests/` is for official regression tests ONLY.** Do NOT place throwaway diagnostic,
+debugging, or exploratory scripts in `tests/` — `pytest tests/` should only execute
+curated, maintained tests, never one-off investigations. Subdirectories are listed in
+Project Structure above.
 
-### Naming Conventions
+### Documentation Naming
 
-- **Test files**: `test_*.py`
-- **Documentation**: `UPPER_CASE.md`
-- **Scripts**: `snake_case.py`
+`UPPER_CASE.md` for docs under `docs/`.
 
 ## Development Commands
 
-### Composing Models
-
-```bash
-# Compose with HuggingFace adapters
-python -m granite_switch.composer.compose_granite_switch \
-  --adapters ibm-granite/granitelib-rag-r1.0
-
-# Multiple adapters
-python -m granite_switch.composer.compose_granite_switch \
-  --adapters ibm-granite/granitelib-rag-r1.0 your-org/extra-adapter
-
-# Custom output directory
-python -m granite_switch.composer.compose_granite_switch \
-  --adapters ibm-granite/granitelib-rag-r1.0 --output ./my-custom-model
-```
-
 ### Testing
 
-**Always use `-v -s --tb=short`** when running tests. `-v` (verbose) prints each test name as
-it starts, giving real-time progress visibility. `-s` disables output capture so `print()`
-statements inside tests appear immediately instead of being swallowed. Without these, long-running
-test files produce no output until they finish. `-x` (fail fast) stops on the first failure —
+**Always use `-v -s --tb=short`** when running tests. `-x` (fail fast) stops on the first failure —
 no point running 200 more tests after something breaks.
 
 **Check GPU availability first** — the underlying hardware can change between sessions:
@@ -181,9 +56,6 @@ no point running 200 more tests after something breaks.
 python -c "import torch; print('GPU' if torch.cuda.is_available() else 'CPU only')"
 ```
 
-This determines which tests can run. vLLM and integration tests require a GPU; unit and HF tests
-run on CPU.
-
 **Run tests incrementally by directory**, in order of speed — don't run the full suite as a
 single command:
 
@@ -201,118 +73,23 @@ pytest tests/vllm/test_model_forward.py -v -s --tb=short -x
 
 # 4. Integration tests last (slowest, GPU required)
 pytest tests/integration/ -v -s --tb=short -x
-
-# Run a specific test pattern when debugging
-pytest tests/ -k "pattern" -v -s --tb=short -x
-```
-
-### vLLM Deployment
-
-```bash
-# Verify plugin registration
-python -c "from vllm.plugins import load_general_plugins; \
-           from vllm import ModelRegistry; \
-           load_general_plugins(); \
-           print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')"
-
-# Start API server
-python -m vllm.entrypoints.openai.api_server \
-  --model ./granite-with-all-aloras \
-  --port 8000
-```
-
-## Architecture
-
-### Granite Switch Model
-
-The Granite Switch extends the base Granite model with:
-
-1. **Embedded LoRA Adapters** (frozen during inference)
-   - Multiple task/domain-specific adapters embedded in the same checkpoint
-   - Each adapter has LoRA weights (lora_A, lora_B) stacked in tensors
-   - Controlled via special tokens or router-selected indices
-
-2. **Control Tokens**
-   - Each adapter has a control token `<|adapter|>` that fires the switch
-   - KV hiding uses group-based control dimensions (K=finfo.min, Q=per-adapter policy)
-   - Control tokens are KV-hidden to prevent cross-request interference
-
-3. **Chat Template Integration**
-   - Maps adapter names to control tokens
-   - Automatic token placement based on adapter type (ALORA vs LORA)
-
-4. **Optional Trainable Router** (SingleSwitch)
-   - N transformer layers that compute adapter indices per-token
-   - Linear projection head to num_adapters dimensions
-   - ~1-2% of total model parameters
-
-### Two Backends
-
-#### HuggingFace Backend (`granite_switch.hf`)
-
-**Purpose**: Model building and optional router training
-
-- Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`)
-- Training with `Trainer` API
-- Standard PyTorch operations
-
-#### vLLM Backend (`granite_switch.vllm`)
-
-**Purpose**: Fast production inference (10-20x speedup)
-
-- Punica kernels for optimized LoRA computation
-- PagedAttention for efficient KV cache
-- Continuous batching, tensor/pipeline parallelism
-- OpenAI-compatible API server
-
-### Weight Compatibility
-
-Both backends share the same weight format:
-
-```python
-# Built/trained with HuggingFace
-model_hf.save_pretrained("./checkpoint")
-
-# Loaded directly with vLLM
-llm = LLM(model="./checkpoint")
 ```
 
 ## Key Configuration Parameters
 
-### Granite-Specific Parameters
-
 - **`attention_multiplier`**: Attention score scaling (instead of `1/sqrt(head_dim)`)
-- **`logits_scaling`**: Applied to final logits (main architectural difference with Llama)
+- **`logits_scaling`**: Applied to final logits
 - **`residual_multiplier`**: Applied to residual connections
 - **`embedding_multiplier`**: Applied to input embeddings
 
-Always use config values - never hardcode these parameters.
-
-### Switch Configuration
-
-```json
-{
-  "model_type": "granite_switch",
-  "architectures": ["GraniteSwitchForCausalLM"],
-  "num_adapters": 4,
-  "adapter_token_ids": [100, 101, 102, 103],
-  "adapter_names": ["adapter_0", "adapter_1", "adapter_2", "adapter_3"],
-  "hiding_groups": {"all_controls": ["adapter_0", "adapter_1", "adapter_2", "adapter_3"]},
-  "hiding_policy": {"base": ["all_controls"], "adapter_0": ["all_controls"], "...": "..."},
-  "lora_rank": 8,
-  "lora_alpha": 8.0,
-  "switch_head_dim": 32,
-  "control_dims": 32
-}
-```
+Always use config values — never hardcode these parameters.
 
 ## Common Gotchas
 
 ### 1. Adapter Index Convention
 
-**Control tokens**: `0` = no adapter, `1+` = adapter indices
-
-**vLLM Punica kernels**: `-1` = no adapter (internal conversion: `adapter_indices - 1`)
+`0` = no adapter, `1+` = adapter index. (vLLM Punica kernels use a shifted convention internally —
+see `src/granite_switch/vllm/CLAUDE.md`.)
 
 ### 2. Control Token Generatability
 
@@ -324,69 +101,30 @@ model can produce any control token during generation.
 - **ALORA adapters**: Token placed either in user message by matching invocation sequence or right before generation prompt
 - **LORA adapters**: Token placed at sequence beginning
 
-### 4. Granite vs Llama Differences
-
-- Granite uses `logits_scaling` (typically 8.0)
-- Custom attention scaling via `attention_multiplier`
-- Different residual and embedding multipliers
-
-Always load from config, never hardcode.
-
-### 5. End-to-End Tests Must Use Compose Infrastructure
-
-No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights`
-directly.  All model construction must go through `GraniteSwitchComposer` so that the
-compose pipeline itself is what's being tested.  If the composer can't handle a use case
-(e.g., zero-adapter skinning), extend the composer — don't work around it in tests.
-
-### 6. HF Attention Backends and Causal Masking
-
-The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask
-(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal`
-attribute on the module.
-
-The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work on the
-current platform by probing each with a k=-inf GQA call at import time. Unavailable backends are skipped.
-
-### 7. Known Limitation: Hidden Count Offset When Position 0 is in a Hiding Group
+### 4. Hidden Count Offset When Position 0 is in a Hiding Group
 
 When position 0 is a control token in a hiding group (e.g., a LoRA prefix token with
 `add_bos_token=False`), `hidden_count` is off by 1, causing a 1-position RoPE offset. This is
 acceptable because adapter detection is exact and RoPE is robust to small positional shifts.
 
-### 8. Known Limitation: TP Row-Parallel Bias Doubling
-
-`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of
-suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite
-architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`.
+### Backend- and module-specific gotchas
 
-### 9. HF Backend Uses Fused Projections (Not Bit-Exact with Upstream HF)
+Loaded on demand from child CLAUDE.md files when you touch those modules:
 
-The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM
-backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate projections.
-Fused projections change the floating-point reduction order, so bit-exact skinning equivalence
-with the upstream HF model is not achievable. The vLLM skinning equivalence tests are the
-authoritative check — both the upstream and skinned models use the same fused-projection
-architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are
-skipped for this reason.
+- `src/granite_switch/hf/CLAUDE.md` — HF attention backends, fused projections vs upstream HF
+- `src/granite_switch/vllm/CLAUDE.md` — Punica `-1` index, TP row-parallel bias, deployment commands
+- `src/granite_switch/composer/CLAUDE.md` — compose-infra rule for e2e tests, compose CLI
 
 ## Documentation
 
+- `docs/ARCHITECTURE.md` - Architecture overview (control tokens, backends, SingleSwitch)
 - `docs/GIT_WORKFLOW.md` - Git branching strategy and commit guidelines
 - `docs/SUPPORTED_MODELS.md` - Model compatibility
 
 ## Git Workflow
 
-**See [docs/GIT_WORKFLOW.md](docs/GIT_WORKFLOW.md) for complete git workflow guidelines.**
-
-**Quick reference:**
-
-- **Branch naming**: `feature/ticket-ID-description` or `bugfix/ticket-ID-description`
-- **Workflow**: Branch from `main` → develop → rebase → PR → merge → delete branch
-- **Critical**: Always verify comments match code before committing (see GIT_WORKFLOW.md)
-- **Commit format**: Clear summary + explanation of WHAT changed and WHY
-
-When committing, **never sign as Claude** (per project instructions)
+See [docs/GIT_WORKFLOW.md](docs/GIT_WORKFLOW.md) for branch naming, commit format, and
+PR workflow. **When committing, never sign as Claude** (per project instructions).
 
 ## License
 
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 0000000..7da3add
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,76 @@
+# Architecture
+
+## Granite Switch Model
+
+The Granite Switch extends the base Granite model with:
+
+### 1. Embedded LoRA Adapters (frozen during inference)
+
+Multiple task/domain-specific adapters are embedded in the same checkpoint. Each adapter has
+LoRA weights (`lora_A`, `lora_B`) stacked in tensors and is activated via special control tokens
+or router-selected indices.
+
+### 2. Control Tokens
+
+Each adapter has a control token `<|adapter|>` that fires the switch. KV hiding uses
+group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are
+KV-hidden to prevent cross-request interference.
+
+### 3. Chat Template Integration
+
+The tokenizer chat template maps adapter names to control tokens and places them automatically
+based on adapter type:
+
+- **ALORA adapters**: token placed either in the user message (by matching the invocation
+  sequence) or right before the generation prompt
+- **LORA adapters**: token placed at sequence beginning
+
+### 4. Optional Trainable Router (SingleSwitch)
+
+SingleSwitch is a single attention head that uses a one-hot dim-0 pattern to compute per-token
+adapter indices via attention-based cumsum. It has no decoder layers and no projection head —
+only a vocab-size lookup table, so parameter cost is negligible relative to the full model.
+
+---
+
+## Two Backends
+
+Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`).
+
+### HuggingFace Backend (`granite_switch.hf`)
+
+Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and
+debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order
+relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`).
+
+### vLLM Backend (`granite_switch.vllm`)
+
+Production inference backend (10-20x speedup). Uses Punica kernels for optimized LoRA
+computation, PagedAttention for efficient KV cache, and supports continuous batching and
+tensor/pipeline parallelism. Registered as a vLLM plugin via the `granite_switch.vllm` entry point.
+
+---
+
+## Key Configuration Fields
+
+These fields are specific to Granite Switch and not present in base Granite:
+
+| Field | Description |
+|---|---|
+| `num_adapters` | Number of embedded LoRA adapters |
+| `adapter_token_ids` | Token IDs for each adapter's control token |
+| `adapter_names` | Human-readable names for each adapter |
+| `hiding_groups` | Named groups of adapters for KV hiding |
+| `hiding_policy` | Per-adapter KV hiding rules |
+| `lora_rank` | LoRA rank (same for all adapters) |
+| `lora_alpha` | LoRA alpha scaling factor |
+| `control_dims` | Number of KV dimensions reserved for control |
+
+### Granite-Specific Parameters (inherited from base model)
+
+- **`attention_multiplier`**: Attention score scaling (replaces `1/sqrt(head_dim)`)
+- **`logits_scaling`**: Applied to final logits
+- **`residual_multiplier`**: Applied to residual connections
+- **`embedding_multiplier`**: Applied to input embeddings
+
+Always load these from config — never hardcode.
diff --git a/docs/SUPPORTED_MODELS.md b/docs/SUPPORTED_MODELS.md
index 0094911..41f6c66 100644
--- a/docs/SUPPORTED_MODELS.md
+++ b/docs/SUPPORTED_MODELS.md
@@ -16,16 +16,13 @@ automatically from the HuggingFace `config.model_type` field.
 Any Granite model whose HuggingFace config has `model_type: granite` can be used
 as a base model. The table below lists representative examples.
 
-**Note:** Granite Switch currently supports single-GPU inference only. Models
-that do not fit in a single GPU's memory are not yet supported.
-
 #### Granite 4.x (`granite`)
 
 | Model Tag | Size | Variant |
 |---|---|---|
 | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct |
 | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct |
-| `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct |
+| `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct |
 
 Base variants (`granite-4.1-3b-base`, `granite-4.1-8b-base`) are also supported.
 
diff --git a/src/granite_switch/composer/CLAUDE.md b/src/granite_switch/composer/CLAUDE.md
new file mode 100644
index 0000000..7a7fbc9
--- /dev/null
+++ b/src/granite_switch/composer/CLAUDE.md
@@ -0,0 +1,18 @@
+# CLAUDE.md — composer/
+
+Compose system: builds Granite Switch checkpoints from a base model + LoRA adapters. Loaded
+automatically when reading any file under `src/granite_switch/composer/`.
+
+## End-to-End Tests Must Use Compose Infrastructure
+
+No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` directly.
+All model construction must go through `GraniteSwitchComposer` so that the compose pipeline
+itself is what's being tested. If the composer can't handle a use case (e.g., zero-adapter
+skinning), extend the composer — don't work around it in tests.
+
+## Composing Models
+
+```bash
+python -m granite_switch.composer.compose_granite_switch \
+  --adapters ibm-granite/granitelib-rag-r1.0
+```
diff --git a/src/granite_switch/hf/CLAUDE.md b/src/granite_switch/hf/CLAUDE.md
new file mode 100644
index 0000000..0e8e935
--- /dev/null
+++ b/src/granite_switch/hf/CLAUDE.md
@@ -0,0 +1,23 @@
+# CLAUDE.md — hf/
+
+HuggingFace backend for training and debugging. Loaded automatically when reading any file under `src/granite_switch/hf/`.
+
+## HF Attention Backends and Causal Masking
+
+The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask
+(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal`
+attribute on the module.
+
+The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work
+on the current platform by probing each with a k=-inf GQA call at import time. Unavailable
+backends are skipped.
+
+## Fused Projections (Not Bit-Exact with Upstream HF)
+
+The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM
+backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate
+projections. Fused projections change the floating-point reduction order, so bit-exact skinning
+equivalence with the upstream HF model is not achievable. The vLLM skinning equivalence tests
+are the authoritative check — both the upstream and skinned models use the same fused-projection
+architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are
+skipped for this reason.
diff --git a/src/granite_switch/vllm/CLAUDE.md b/src/granite_switch/vllm/CLAUDE.md
new file mode 100644
index 0000000..b1c6018
--- /dev/null
+++ b/src/granite_switch/vllm/CLAUDE.md
@@ -0,0 +1,29 @@
+# CLAUDE.md — vllm/
+
+vLLM backend for production inference. Loaded automatically when reading any file under `src/granite_switch/vllm/`.
+
+## Adapter Index Convention (vLLM-specific)
+
+Punica kernels use `-1` = no adapter. Internal conversion from the shared convention:
+`adapter_indices - 1` (so the shared `0` = no adapter becomes `-1` for Punica).
+
+## Known Limitation: TP Row-Parallel Bias Doubling
+
+`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of
+suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite
+architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`.
+
+## Deployment
+
+```bash
+# Verify plugin registration
+python -c "from vllm.plugins import load_general_plugins; \
+           from vllm import ModelRegistry; \
+           load_general_plugins(); \
+           print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')"
+
+# Start API server
+python -m vllm.entrypoints.openai.api_server \
+  --model ./granite-with-all-aloras \
+  --port 8000
+```
diff --git a/tutorials/CLAUDE.md b/tutorials/CLAUDE.md
new file mode 100644
index 0000000..7306b5c
--- /dev/null
+++ b/tutorials/CLAUDE.md
@@ -0,0 +1,40 @@
+# CLAUDE.md — tutorials/
+
+This file provides guidance when working on notebooks and guides in this directory.
+Claude loads it automatically when reading any file under `tutorials/`.
+
+## Notebook Cell Ordering
+
+Every notebook follows this cell order:
+
+1. `%pip install ...` — dependencies
+2. HF login cell (see below)
+3. Imports
+4. Configuration (model path, ports, constants)
+5. Long-running steps (corpus build, model load, vLLM launch)
+
+## HF Login Cell
+
+Every notebook that downloads gated HF models (`ibm-granite/`) must have a dedicated cell
+immediately after pip install:
+
+```python
+from huggingface_hub import notebook_login
+notebook_login()  # needed to pull ibm-granite models from the Hub
+```
+
+Use cell id `hf-login-call` for consistency.
+
+## Duration Comments
+
+Add `# Estimated duration: ~2 min on A100, ~7 min on T4` to cells that download models or
+launch vLLM. Put these in **notebook cells only** — not in code files under `src/`.
+
+## Utility Modules
+
+These live in `src/granite_switch/tutorials/` and are imported by notebooks:
+
+- `vllm_server.py` — `launch_vllm()`, `wait_for_server()` (reads the vLLM log and prints
+  stage-based progress), `kill_stale_vllm_processes()`
+- `chroma_loader.py` — `load_or_build_chroma()`: builds corpus on GPU, frees GPU memory with
+  `torch.cuda.empty_cache()`, then switches to CPU for queries so vLLM can use the full GPU
diff --git a/tutorials/PREREQUISITES.md b/tutorials/PREREQUISITES.md
index 9203aff..aa75d0e 100644
--- a/tutorials/PREREQUISITES.md
+++ b/tutorials/PREREQUISITES.md
@@ -15,7 +15,7 @@ Setup requirements for running Granite Switch tutorials.
 
 ### Python Version
 
-Python 3.10+ is required.
+Python 3.11–3.13 is required.
 
 ### Base Installation
 
diff --git a/tutorials/guides/build_your_own_adapter.md b/tutorials/guides/build_your_own_adapter.md
index 7af4d92..9a894ce 100644
--- a/tutorials/guides/build_your_own_adapter.md
+++ b/tutorials/guides/build_your_own_adapter.md
@@ -183,7 +183,7 @@ The base model's tokenizer and generation assets (`generation_config.json`, `mer
 
 ## Step 4: Use the Composed Model
 
-> **Note:** Custom (BYOA) adapters are not supported by [Mellea](https://github.com/generative-computing/mellea). Mellea only supports the official IBM Granite Library adapters. To invoke your custom adapters, use the chat template directly as shown below.
+> **Note:** The high-level Mellea wrappers (`guardian_check`, `rag.rewrite_question`, etc.) are built for the official IBM Granite Library adapters. Custom adapters can be invoked through Mellea's lower-level `Intrinsic` API — see [Bring Your Own Adapter with Mellea](mellea_build_your_own_adapter.md). To invoke adapters without Mellea at all, use the chat template directly as shown below.
 
 ### With HuggingFace