Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# OS files
.DS_Store

# Claude Code
# Claude Code — ignore local settings but track shared skills
.claude/
!.claude/skills/
!.claude/skills/**

# Python cache files
*.pyc
Expand Down
318 changes: 28 additions & 290 deletions CLAUDE.md

Large diffs are not rendered by default.

76 changes: 76 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Architecture

## Granite Switch Model

The Granite Switch extends the base Granite model with:

### 1. Embedded LoRA Adapters (frozen during inference)

Multiple task/domain-specific adapters are embedded in the same checkpoint. Each adapter has
LoRA weights (`lora_A`, `lora_B`) stacked in tensors and is activated via special control tokens
or router-selected indices.

### 2. Control Tokens

Each adapter has a control token `<|adapter|>` that fires the switch. KV hiding uses
group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are
KV-hidden to prevent cross-request interference.

### 3. Chat Template Integration
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the fact that those are activated using the adapter_name arg in apply_chat_template?


The tokenizer chat template maps adapter names to control tokens and places them automatically
based on adapter type:

- **ALORA adapters**: token placed either in the user message (by matching the invocation
sequence) or right before the generation prompt
- **LORA adapters**: token placed at sequence beginning

### 4. Optional Trainable Router (SingleSwitch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adapt this section to better reflect how the switch really works?


SingleSwitch is a single attention head that uses a one-hot dim-0 pattern to compute per-token
adapter indices via attention-based cumsum. It has no decoder layers and no projection head —
only a vocab-size lookup table, so parameter cost is negligible relative to the full model.

---

## Two Backends

Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_pretrained is HF API. The checkpoint format is called Safetensors format


### HuggingFace Backend (`granite_switch.hf`)

Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and
debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order
relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumeration is no longer correct


### vLLM Backend (`granite_switch.vllm`)

Production inference backend (10-20x speedup). Uses Punica kernels for optimized LoRA
computation, PagedAttention for efficient KV cache, and supports continuous batching and
tensor/pipeline parallelism. Registered as a vLLM plugin via the `granite_switch.vllm` entry point.

---

## Key Configuration Fields

These fields are specific to Granite Switch and not present in base Granite:

| Field | Description |
|---|---|
| `num_adapters` | Number of embedded LoRA adapters |
| `adapter_token_ids` | Token IDs for each adapter's control token |
| `adapter_names` | Human-readable names for each adapter |
| `hiding_groups` | Named groups of adapters for KV hiding |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not up-to date. Are you referring to the object args or the model config?

| `hiding_policy` | Per-adapter KV hiding rules |
| `lora_rank` | LoRA rank (same for all adapters) |
| `lora_alpha` | LoRA alpha scaling factor |
| `control_dims` | Number of KV dimensions reserved for control |

### Granite-Specific Parameters (inherited from base model)

- **`attention_multiplier`**: Attention score scaling (replaces `1/sqrt(head_dim)`)
- **`logits_scaling`**: Applied to final logits
- **`residual_multiplier`**: Applied to residual connections
- **`embedding_multiplier`**: Applied to input embeddings

Always load these from config — never hardcode.
5 changes: 1 addition & 4 deletions docs/SUPPORTED_MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,13 @@ automatically from the HuggingFace `config.model_type` field.
Any Granite model whose HuggingFace config has `model_type: granite` can be used
as a base model. The table below lists representative examples.

**Note:** Granite Switch currently supports single-GPU inference only. Models
that do not fit in a single GPU's memory are not yet supported.

#### Granite 4.x (`granite`)

| Model Tag | Size | Variant |
|---|---|---|
| `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct |
| `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct |
| `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct |
| `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granite-4.0-micro is still supported


Base variants (`granite-4.1-3b-base`, `granite-4.1-8b-base`) are also supported.

Expand Down
18 changes: 18 additions & 0 deletions src/granite_switch/composer/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# CLAUDE.md — composer/

Compose system: builds Granite Switch checkpoints from a base model + LoRA adapters. Loaded
automatically when reading any file under `src/granite_switch/composer/`.

## End-to-End Tests Must Use Compose Infrastructure

No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` directly.
All model construction must go through `GraniteSwitchComposer` so that the compose pipeline
itself is what's being tested. If the composer can't handle a use case (e.g., zero-adapter
skinning), extend the composer — don't work around it in tests.

## Composing Models

```bash
python -m granite_switch.composer.compose_granite_switch \
--adapters ibm-granite/granitelib-rag-r1.0
```
23 changes: 23 additions & 0 deletions src/granite_switch/hf/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# CLAUDE.md — hf/

HuggingFace backend for training and debugging. Loaded automatically when reading any file under `src/granite_switch/hf/`.

## HF Attention Backends and Causal Masking

The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask
(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal`
attribute on the module.

The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work
on the current platform by probing each with a k=-inf GQA call at import time. Unavailable
backends are skipped.

## Fused Projections (Not Bit-Exact with Upstream HF)

The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM
backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate
projections. Fused projections change the floating-point reduction order, so bit-exact skinning
equivalence with the upstream HF model is not achievable. The vLLM skinning equivalence tests
are the authoritative check — both the upstream and skinned models use the same fused-projection
architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are
skipped for this reason.
29 changes: 29 additions & 0 deletions src/granite_switch/vllm/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# CLAUDE.md — vllm/

vLLM backend for production inference. Loaded automatically when reading any file under `src/granite_switch/vllm/`.

## Adapter Index Convention (vLLM-specific)

Punica kernels use `-1` = no adapter. Internal conversion from the shared convention:
`adapter_indices - 1` (so the shared `0` = no adapter becomes `-1` for Punica).

## Known Limitation: TP Row-Parallel Bias Doubling

`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of
suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite
architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`.

## Deployment

```bash
# Verify plugin registration
python -c "from vllm.plugins import load_general_plugins; \
from vllm import ModelRegistry; \
load_general_plugins(); \
print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')"

# Start API server
python -m vllm.entrypoints.openai.api_server \
--model ./granite-with-all-aloras \
--port 8000
```
40 changes: 40 additions & 0 deletions tutorials/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# CLAUDE.md — tutorials/

This file provides guidance when working on notebooks and guides in this directory.
Claude loads it automatically when reading any file under `tutorials/`.

## Notebook Cell Ordering

Every notebook follows this cell order:

1. `%pip install ...` — dependencies
2. HF login cell (see below)
3. Imports
4. Configuration (model path, ports, constants)
5. Long-running steps (corpus build, model load, vLLM launch)

## HF Login Cell

Every notebook that downloads gated HF models (`ibm-granite/`) must have a dedicated cell
immediately after pip install:

```python
from huggingface_hub import notebook_login
notebook_login() # needed to pull ibm-granite models from the Hub
```

Use cell id `hf-login-call` for consistency.

## Duration Comments

Add `# Estimated duration: ~2 min on A100, ~7 min on T4` to cells that download models or
launch vLLM. Put these in **notebook cells only** — not in code files under `src/`.

## Utility Modules

These live in `src/granite_switch/tutorials/` and are imported by notebooks:

- `vllm_server.py` — `launch_vllm()`, `wait_for_server()` (reads the vLLM log and prints
stage-based progress), `kill_stale_vllm_processes()`
- `chroma_loader.py` — `load_or_build_chroma()`: builds corpus on GPU, frees GPU memory with
`torch.cuda.empty_cache()`, then switches to CPU for queries so vLLM can use the full GPU
2 changes: 1 addition & 1 deletion tutorials/PREREQUISITES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Setup requirements for running Granite Switch tutorials.

### Python Version

Python 3.10+ is required.
Python 3.11–3.13 is required.

### Base Installation

Expand Down
2 changes: 1 addition & 1 deletion tutorials/guides/build_your_own_adapter.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ The base model's tokenizer and generation assets (`generation_config.json`, `mer

## Step 4: Use the Composed Model

> **Note:** Custom (BYOA) adapters are not supported by [Mellea](https://github.com/generative-computing/mellea). Mellea only supports the official IBM Granite Library adapters. To invoke your custom adapters, use the chat template directly as shown below.
> **Note:** The high-level Mellea wrappers (`guardian_check`, `rag.rewrite_question`, etc.) are built for the official IBM Granite Library adapters. Custom adapters can be invoked through Mellea's lower-level `Intrinsic` API — see [Bring Your Own Adapter with Mellea](mellea_build_your_own_adapter.md). To invoke adapters without Mellea at all, use the chat template directly as shown below.

### With HuggingFace

Expand Down