diff --git a/.gitignore b/.gitignore index a5dcceb..9417ad2 100644 --- a/.gitignore +++ b/.gitignore @@ -1,8 +1,10 @@ # OS files .DS_Store -# Claude Code +# Claude Code — ignore local settings but track shared skills .claude/ +!.claude/skills/ +!.claude/skills/** # Python cache files *.pyc diff --git a/CLAUDE.md b/CLAUDE.md index 404065e..1326482 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,116 +4,21 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Repository Overview -**granite-switch** implements **Granite Switch**, a system for building and deploying Granite models with embedded LoRA adapters. The system is a single unified Python package (`granite_switch`) with optional extras for different backends. - -1. **Building models with embedded adapters** - Combine a base Granite model with multiple LoRA adapters into a single checkpoint -2. **Automatic adapter control** - Activate adapters via special control tokens or chat templates -3. **Fast inference** - Deploy with vLLM for speedup over standard HuggingFace inference -4. **Optional trainable switching** - Train a router to automatically select adapters per-token +**granite-switch** is a single Python package (`granite_switch`) for building and deploying Granite models with embedded LoRA adapters. Two backends share the same weight format: `granite_switch.hf` (HuggingFace, training) and `granite_switch.vllm` (production inference, 10-20x speedup via Punica kernels + PagedAttention). ## Project Structure -``` -granite-switch/ -├── pyproject.toml # Single package definition with optional extras -├── src/ -│ └── granite_switch/ # Unified package -│ ├── __init__.py # Core exports (GraniteSwitchConfig, __version__) -│ ├── config.py # Unified GraniteSwitchConfig -│ │ -│ ├── composer/ # Compose system (requires [compose] extra) -│ │ ├── __init__.py -│ │ ├── adapter_discovery.py # Adapter discovery and resolution -│ │ ├── adapter_loader.py # Adapter weight loading -│ │ ├── arch.py # Architecture definitions -│ │ ├── compose_granite_switch.py # Main compose script (CLI entry point) -│ │ ├── compose_utils.py # GraniteSwitchComposer class -│ │ ├── tokenizer_setup.py # Tokenizer configuration for control tokens -│ │ ├── validator.py # Compose validation checks -│ │ ├── weight_remapper.py # Adapter name remapping (AdapterRemapper) -│ │ ├── weight_transfer.py # Base model weight transfer -│ │ └── reporting/ # Compose reporting utilities -│ │ ├── __init__.py -│ │ ├── adapter_analysis.py -│ │ ├── compose_report.py -│ │ ├── hiding_constant_report.py -│ │ ├── model_card.py -│ │ └── population_table.py -│ │ -│ ├── hf/ # HuggingFace backend (requires [hf] extra) -│ │ ├── __init__.py # Registers with transformers AutoConfig/AutoModel -│ │ ├── modeling_granite_switch.py -│ │ ├── core/ -│ │ │ ├── __init__.py -│ │ │ └── lora.py # SwitchedLoRALinear, MergedSwitchedLoRALinear -│ │ └── switch/ -│ │ ├── __init__.py -│ │ └── single.py # SingleSwitch (HF attention backends) -│ │ -│ └── vllm/ # vLLM backend (requires [vllm] extra) -│ ├── __init__.py # register() for vLLM plugin system -│ ├── granite_switch_model.py -│ ├── core/ -│ │ ├── __init__.py -│ │ ├── lora.py # SwitchedLoRALinear (Punica kernels) -│ │ ├── lora_kernel_meta.py -│ │ └── decoder.py # Decoder layers -│ └── switch/ -│ ├── __init__.py -│ └── single.py # SingleSwitch (vLLM Attention) -│ -├── tests/ # All tests -│ ├── unit/ # Unit tests (fastest, CPU) -│ ├── hf/ # HuggingFace-specific tests -│ ├── vllm/ # vLLM-specific tests -│ ├── composer/ # Compose system tests -│ ├── integration/ # Cross-backend integration tests -│ ├── regression/ # Regression tests (hf/, vllm/, integration/, shared/, tools/) -│ └── shared/ # Shared test utilities and parametrized cases -│ -├── scratch/ # Throwaway debug/diagnostic scripts (gitignored) -├── docs/ # Documentation -├── tutorials/ # Tutorials and how-to guides -├── CLAUDE.md # This file -└── README.md -``` +Key layout rules — full tree via `find src/` or `find tests/`: + +- `src/granite_switch/` — unified package; `composer/`, `hf/`, `vllm/` match the optional extras +- `tests/` — official test suite only; subdirs: `unit/`, `hf/`, `vllm/`, `composer/`, `integration/`, `regression/`, `shared/` +- `tutorials/` — notebooks and guides; see `tutorials/CLAUDE.md` for conventions ## Installation (local/dev) ```bash -# Core package only (config) -pip install -e . - -# With HuggingFace backend -pip install -e ".[hf]" - -# With vLLM backend -pip install -e ".[vllm]" - -# With compose tools -pip install -e ".[compose]" - -# Everything (development) -pip install -e ".[dev]" -``` - -## Import Paths - -```python -# Config (shared by all backends) -from granite_switch import GraniteSwitchConfig -from granite_switch.config import GraniteSwitchConfig # equivalent - -# HuggingFace backend -from granite_switch.hf import GraniteSwitchForCausalLM -from granite_switch.hf.core.lora import SwitchedLoRALinear -from granite_switch.hf.switch.single import SingleSwitch - -# vLLM backend (auto-registered via plugin entry point) -from granite_switch.vllm import register - -# Compose system -from granite_switch.composer import GraniteSwitchComposer +pip install -e ".[dev]" # everything (recommended for development) +pip install -e ".[hf,compose]" # HF + composer only (no vLLM) ``` ## File Organization Convention @@ -129,50 +34,20 @@ from granite_switch.composer import GraniteSwitchComposer ### Test Files (Python) -**All `test_*.py` test files MUST go in a `tests/` directory:** - -- **`tests/unit/`**: Unit tests (fastest, CPU-only) -- **`tests/hf/`**: HuggingFace implementation tests -- **`tests/vllm/`**: vLLM implementation tests -- **`tests/composer/`**: Compose system tests -- **`tests/integration/`**: Cross-implementation and end-to-end integration tests -- **`tests/regression/`**: Regression tests (hf/, vllm/, integration/, shared/, tools/) -- **`tests/shared/`**: Shared test utilities and parametrized cases - -**IMPORTANT: `tests/` is for official regression tests ONLY.** Do NOT place throwaway diagnostic, -debugging, or exploratory scripts in `tests/`. Use `scratch/` instead (it is gitignored). Running -`pytest tests/` should only execute curated, maintained tests — never one-off investigations. +**`tests/` is for official regression tests ONLY.** Do NOT place throwaway diagnostic, +debugging, or exploratory scripts in `tests/` — `pytest tests/` should only execute +curated, maintained tests, never one-off investigations. Subdirectories are listed in +Project Structure above. -### Naming Conventions +### Documentation Naming -- **Test files**: `test_*.py` -- **Documentation**: `UPPER_CASE.md` -- **Scripts**: `snake_case.py` +`UPPER_CASE.md` for docs under `docs/`. ## Development Commands -### Composing Models - -```bash -# Compose with HuggingFace adapters -python -m granite_switch.composer.compose_granite_switch \ - --adapters ibm-granite/granitelib-rag-r1.0 - -# Multiple adapters -python -m granite_switch.composer.compose_granite_switch \ - --adapters ibm-granite/granitelib-rag-r1.0 your-org/extra-adapter - -# Custom output directory -python -m granite_switch.composer.compose_granite_switch \ - --adapters ibm-granite/granitelib-rag-r1.0 --output ./my-custom-model -``` - ### Testing -**Always use `-v -s --tb=short`** when running tests. `-v` (verbose) prints each test name as -it starts, giving real-time progress visibility. `-s` disables output capture so `print()` -statements inside tests appear immediately instead of being swallowed. Without these, long-running -test files produce no output until they finish. `-x` (fail fast) stops on the first failure — +**Always use `-v -s --tb=short`** when running tests. `-x` (fail fast) stops on the first failure — no point running 200 more tests after something breaks. **Check GPU availability first** — the underlying hardware can change between sessions: @@ -181,9 +56,6 @@ no point running 200 more tests after something breaks. python -c "import torch; print('GPU' if torch.cuda.is_available() else 'CPU only')" ``` -This determines which tests can run. vLLM and integration tests require a GPU; unit and HF tests -run on CPU. - **Run tests incrementally by directory**, in order of speed — don't run the full suite as a single command: @@ -201,118 +73,23 @@ pytest tests/vllm/test_model_forward.py -v -s --tb=short -x # 4. Integration tests last (slowest, GPU required) pytest tests/integration/ -v -s --tb=short -x - -# Run a specific test pattern when debugging -pytest tests/ -k "pattern" -v -s --tb=short -x -``` - -### vLLM Deployment - -```bash -# Verify plugin registration -python -c "from vllm.plugins import load_general_plugins; \ - from vllm import ModelRegistry; \ - load_general_plugins(); \ - print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')" - -# Start API server -python -m vllm.entrypoints.openai.api_server \ - --model ./granite-with-all-aloras \ - --port 8000 -``` - -## Architecture - -### Granite Switch Model - -The Granite Switch extends the base Granite model with: - -1. **Embedded LoRA Adapters** (frozen during inference) - - Multiple task/domain-specific adapters embedded in the same checkpoint - - Each adapter has LoRA weights (lora_A, lora_B) stacked in tensors - - Controlled via special tokens or router-selected indices - -2. **Control Tokens** - - Each adapter has a control token `<|adapter|>` that fires the switch - - KV hiding uses group-based control dimensions (K=finfo.min, Q=per-adapter policy) - - Control tokens are KV-hidden to prevent cross-request interference - -3. **Chat Template Integration** - - Maps adapter names to control tokens - - Automatic token placement based on adapter type (ALORA vs LORA) - -4. **Optional Trainable Router** (SingleSwitch) - - N transformer layers that compute adapter indices per-token - - Linear projection head to num_adapters dimensions - - ~1-2% of total model parameters - -### Two Backends - -#### HuggingFace Backend (`granite_switch.hf`) - -**Purpose**: Model building and optional router training - -- Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`) -- Training with `Trainer` API -- Standard PyTorch operations - -#### vLLM Backend (`granite_switch.vllm`) - -**Purpose**: Fast production inference (10-20x speedup) - -- Punica kernels for optimized LoRA computation -- PagedAttention for efficient KV cache -- Continuous batching, tensor/pipeline parallelism -- OpenAI-compatible API server - -### Weight Compatibility - -Both backends share the same weight format: - -```python -# Built/trained with HuggingFace -model_hf.save_pretrained("./checkpoint") - -# Loaded directly with vLLM -llm = LLM(model="./checkpoint") ``` ## Key Configuration Parameters -### Granite-Specific Parameters - - **`attention_multiplier`**: Attention score scaling (instead of `1/sqrt(head_dim)`) -- **`logits_scaling`**: Applied to final logits (main architectural difference with Llama) +- **`logits_scaling`**: Applied to final logits - **`residual_multiplier`**: Applied to residual connections - **`embedding_multiplier`**: Applied to input embeddings -Always use config values - never hardcode these parameters. - -### Switch Configuration - -```json -{ - "model_type": "granite_switch", - "architectures": ["GraniteSwitchForCausalLM"], - "num_adapters": 4, - "adapter_token_ids": [100, 101, 102, 103], - "adapter_names": ["adapter_0", "adapter_1", "adapter_2", "adapter_3"], - "hiding_groups": {"all_controls": ["adapter_0", "adapter_1", "adapter_2", "adapter_3"]}, - "hiding_policy": {"base": ["all_controls"], "adapter_0": ["all_controls"], "...": "..."}, - "lora_rank": 8, - "lora_alpha": 8.0, - "switch_head_dim": 32, - "control_dims": 32 -} -``` +Always use config values — never hardcode these parameters. ## Common Gotchas ### 1. Adapter Index Convention -**Control tokens**: `0` = no adapter, `1+` = adapter indices - -**vLLM Punica kernels**: `-1` = no adapter (internal conversion: `adapter_indices - 1`) +`0` = no adapter, `1+` = adapter index. (vLLM Punica kernels use a shifted convention internally — +see `src/granite_switch/vllm/CLAUDE.md`.) ### 2. Control Token Generatability @@ -324,69 +101,30 @@ model can produce any control token during generation. - **ALORA adapters**: Token placed either in user message by matching invocation sequence or right before generation prompt - **LORA adapters**: Token placed at sequence beginning -### 4. Granite vs Llama Differences - -- Granite uses `logits_scaling` (typically 8.0) -- Custom attention scaling via `attention_multiplier` -- Different residual and embedding multipliers - -Always load from config, never hardcode. - -### 5. End-to-End Tests Must Use Compose Infrastructure - -No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` -directly. All model construction must go through `GraniteSwitchComposer` so that the -compose pipeline itself is what's being tested. If the composer can't handle a use case -(e.g., zero-adapter skinning), extend the composer — don't work around it in tests. - -### 6. HF Attention Backends and Causal Masking - -The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask -(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal` -attribute on the module. - -The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work on the -current platform by probing each with a k=-inf GQA call at import time. Unavailable backends are skipped. - -### 7. Known Limitation: Hidden Count Offset When Position 0 is in a Hiding Group +### 4. Hidden Count Offset When Position 0 is in a Hiding Group When position 0 is a control token in a hiding group (e.g., a LoRA prefix token with `add_bos_token=False`), `hidden_count` is off by 1, causing a 1-position RoPE offset. This is acceptable because adapter detection is exact and RoPE is robust to small positional shifts. -### 8. Known Limitation: TP Row-Parallel Bias Doubling - -`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of -suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite -architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`. +### Backend- and module-specific gotchas -### 9. HF Backend Uses Fused Projections (Not Bit-Exact with Upstream HF) +Loaded on demand from child CLAUDE.md files when you touch those modules: -The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM -backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate projections. -Fused projections change the floating-point reduction order, so bit-exact skinning equivalence -with the upstream HF model is not achievable. The vLLM skinning equivalence tests are the -authoritative check — both the upstream and skinned models use the same fused-projection -architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are -skipped for this reason. +- `src/granite_switch/hf/CLAUDE.md` — HF attention backends, fused projections vs upstream HF +- `src/granite_switch/vllm/CLAUDE.md` — Punica `-1` index, TP row-parallel bias, deployment commands +- `src/granite_switch/composer/CLAUDE.md` — compose-infra rule for e2e tests, compose CLI ## Documentation +- `docs/ARCHITECTURE.md` - Architecture overview (control tokens, backends, SingleSwitch) - `docs/GIT_WORKFLOW.md` - Git branching strategy and commit guidelines - `docs/SUPPORTED_MODELS.md` - Model compatibility ## Git Workflow -**See [docs/GIT_WORKFLOW.md](docs/GIT_WORKFLOW.md) for complete git workflow guidelines.** - -**Quick reference:** - -- **Branch naming**: `feature/ticket-ID-description` or `bugfix/ticket-ID-description` -- **Workflow**: Branch from `main` → develop → rebase → PR → merge → delete branch -- **Critical**: Always verify comments match code before committing (see GIT_WORKFLOW.md) -- **Commit format**: Clear summary + explanation of WHAT changed and WHY - -When committing, **never sign as Claude** (per project instructions) +See [docs/GIT_WORKFLOW.md](docs/GIT_WORKFLOW.md) for branch naming, commit format, and +PR workflow. **When committing, never sign as Claude** (per project instructions). ## License diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..7da3add --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,76 @@ +# Architecture + +## Granite Switch Model + +The Granite Switch extends the base Granite model with: + +### 1. Embedded LoRA Adapters (frozen during inference) + +Multiple task/domain-specific adapters are embedded in the same checkpoint. Each adapter has +LoRA weights (`lora_A`, `lora_B`) stacked in tensors and is activated via special control tokens +or router-selected indices. + +### 2. Control Tokens + +Each adapter has a control token `<|adapter|>` that fires the switch. KV hiding uses +group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are +KV-hidden to prevent cross-request interference. + +### 3. Chat Template Integration + +The tokenizer chat template maps adapter names to control tokens and places them automatically +based on adapter type: + +- **ALORA adapters**: token placed either in the user message (by matching the invocation + sequence) or right before the generation prompt +- **LORA adapters**: token placed at sequence beginning + +### 4. Optional Trainable Router (SingleSwitch) + +SingleSwitch is a single attention head that uses a one-hot dim-0 pattern to compute per-token +adapter indices via attention-based cumsum. It has no decoder layers and no projection head — +only a vocab-size lookup table, so parameter cost is negligible relative to the full model. + +--- + +## Two Backends + +Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`). + +### HuggingFace Backend (`granite_switch.hf`) + +Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and +debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order +relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`). + +### vLLM Backend (`granite_switch.vllm`) + +Production inference backend (10-20x speedup). Uses Punica kernels for optimized LoRA +computation, PagedAttention for efficient KV cache, and supports continuous batching and +tensor/pipeline parallelism. Registered as a vLLM plugin via the `granite_switch.vllm` entry point. + +--- + +## Key Configuration Fields + +These fields are specific to Granite Switch and not present in base Granite: + +| Field | Description | +|---|---| +| `num_adapters` | Number of embedded LoRA adapters | +| `adapter_token_ids` | Token IDs for each adapter's control token | +| `adapter_names` | Human-readable names for each adapter | +| `hiding_groups` | Named groups of adapters for KV hiding | +| `hiding_policy` | Per-adapter KV hiding rules | +| `lora_rank` | LoRA rank (same for all adapters) | +| `lora_alpha` | LoRA alpha scaling factor | +| `control_dims` | Number of KV dimensions reserved for control | + +### Granite-Specific Parameters (inherited from base model) + +- **`attention_multiplier`**: Attention score scaling (replaces `1/sqrt(head_dim)`) +- **`logits_scaling`**: Applied to final logits +- **`residual_multiplier`**: Applied to residual connections +- **`embedding_multiplier`**: Applied to input embeddings + +Always load these from config — never hardcode. diff --git a/docs/SUPPORTED_MODELS.md b/docs/SUPPORTED_MODELS.md index 0094911..41f6c66 100644 --- a/docs/SUPPORTED_MODELS.md +++ b/docs/SUPPORTED_MODELS.md @@ -16,16 +16,13 @@ automatically from the HuggingFace `config.model_type` field. Any Granite model whose HuggingFace config has `model_type: granite` can be used as a base model. The table below lists representative examples. -**Note:** Granite Switch currently supports single-GPU inference only. Models -that do not fit in a single GPU's memory are not yet supported. - #### Granite 4.x (`granite`) | Model Tag | Size | Variant | |---|---|---| | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct | | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct | -| `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct | +| `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct | Base variants (`granite-4.1-3b-base`, `granite-4.1-8b-base`) are also supported. diff --git a/src/granite_switch/composer/CLAUDE.md b/src/granite_switch/composer/CLAUDE.md new file mode 100644 index 0000000..7a7fbc9 --- /dev/null +++ b/src/granite_switch/composer/CLAUDE.md @@ -0,0 +1,18 @@ +# CLAUDE.md — composer/ + +Compose system: builds Granite Switch checkpoints from a base model + LoRA adapters. Loaded +automatically when reading any file under `src/granite_switch/composer/`. + +## End-to-End Tests Must Use Compose Infrastructure + +No test should manually assemble `GraniteSwitchConfig` or call `transfer_base_weights` directly. +All model construction must go through `GraniteSwitchComposer` so that the compose pipeline +itself is what's being tested. If the composer can't handle a use case (e.g., zero-adapter +skinning), extend the composer — don't work around it in tests. + +## Composing Models + +```bash +python -m granite_switch.composer.compose_granite_switch \ + --adapters ibm-granite/granitelib-rag-r1.0 +``` diff --git a/src/granite_switch/hf/CLAUDE.md b/src/granite_switch/hf/CLAUDE.md new file mode 100644 index 0000000..0e8e935 --- /dev/null +++ b/src/granite_switch/hf/CLAUDE.md @@ -0,0 +1,23 @@ +# CLAUDE.md — hf/ + +HuggingFace backend for training and debugging. Loaded automatically when reading any file under `src/granite_switch/hf/`. + +## HF Attention Backends and Causal Masking + +The eager backend does NOT handle `attention_mask=None` as causal — it treats `None` as no mask +(full attention). SDPA and FlashAttention handle `attention_mask=None` correctly via `is_causal` +attribute on the module. + +The HF stress tests (`tests/hf/test_single_switch.py`) auto-detect which attention backends work +on the current platform by probing each with a k=-inf GQA call at import time. Unavailable +backends are skipped. + +## Fused Projections (Not Bit-Exact with Upstream HF) + +The GraniteSwitch HF backend uses fused QKV and gate-up projections, symmetric with the vLLM +backend architecture. Upstream HuggingFace `GraniteMoeHybridForCausalLM` uses separate +projections. Fused projections change the floating-point reduction order, so bit-exact skinning +equivalence with the upstream HF model is not achievable. The vLLM skinning equivalence tests +are the authoritative check — both the upstream and skinned models use the same fused-projection +architecture there. The HF skinning tests in `tests/composer/test_skinning_equivalence.py` are +skipped for this reason. diff --git a/src/granite_switch/vllm/CLAUDE.md b/src/granite_switch/vllm/CLAUDE.md new file mode 100644 index 0000000..b1c6018 --- /dev/null +++ b/src/granite_switch/vllm/CLAUDE.md @@ -0,0 +1,29 @@ +# CLAUDE.md — vllm/ + +vLLM backend for production inference. Loaded automatically when reading any file under `src/granite_switch/vllm/`. + +## Adapter Index Convention (vLLM-specific) + +Punica kernels use `-1` = no adapter. Internal conversion from the shared convention: +`adapter_indices - 1` (so the shared `0` = no adapter becomes `-1` for Punica). + +## Known Limitation: TP Row-Parallel Bias Doubling + +`SwitchedLoRALinear`'s row-parallel bypass path passes bias to all TP ranks instead of +suppressing it for rank > 0. After all-reduce this doubles the bias. Not affected: all Granite +architectures (4.0, 4.1) use `attention_bias=False` and `mlp_bias=False`. + +## Deployment + +```bash +# Verify plugin registration +python -c "from vllm.plugins import load_general_plugins; \ + from vllm import ModelRegistry; \ + load_general_plugins(); \ + print('OK' if 'GraniteSwitchForCausalLM' in ModelRegistry.get_supported_archs() else 'FAIL')" + +# Start API server +python -m vllm.entrypoints.openai.api_server \ + --model ./granite-with-all-aloras \ + --port 8000 +``` diff --git a/tutorials/CLAUDE.md b/tutorials/CLAUDE.md new file mode 100644 index 0000000..7306b5c --- /dev/null +++ b/tutorials/CLAUDE.md @@ -0,0 +1,40 @@ +# CLAUDE.md — tutorials/ + +This file provides guidance when working on notebooks and guides in this directory. +Claude loads it automatically when reading any file under `tutorials/`. + +## Notebook Cell Ordering + +Every notebook follows this cell order: + +1. `%pip install ...` — dependencies +2. HF login cell (see below) +3. Imports +4. Configuration (model path, ports, constants) +5. Long-running steps (corpus build, model load, vLLM launch) + +## HF Login Cell + +Every notebook that downloads gated HF models (`ibm-granite/`) must have a dedicated cell +immediately after pip install: + +```python +from huggingface_hub import notebook_login +notebook_login() # needed to pull ibm-granite models from the Hub +``` + +Use cell id `hf-login-call` for consistency. + +## Duration Comments + +Add `# Estimated duration: ~2 min on A100, ~7 min on T4` to cells that download models or +launch vLLM. Put these in **notebook cells only** — not in code files under `src/`. + +## Utility Modules + +These live in `src/granite_switch/tutorials/` and are imported by notebooks: + +- `vllm_server.py` — `launch_vllm()`, `wait_for_server()` (reads the vLLM log and prints + stage-based progress), `kill_stale_vllm_processes()` +- `chroma_loader.py` — `load_or_build_chroma()`: builds corpus on GPU, frees GPU memory with + `torch.cuda.empty_cache()`, then switches to CPU for queries so vLLM can use the full GPU diff --git a/tutorials/PREREQUISITES.md b/tutorials/PREREQUISITES.md index 9203aff..aa75d0e 100644 --- a/tutorials/PREREQUISITES.md +++ b/tutorials/PREREQUISITES.md @@ -15,7 +15,7 @@ Setup requirements for running Granite Switch tutorials. ### Python Version -Python 3.10+ is required. +Python 3.11–3.13 is required. ### Base Installation diff --git a/tutorials/guides/build_your_own_adapter.md b/tutorials/guides/build_your_own_adapter.md index 7af4d92..9a894ce 100644 --- a/tutorials/guides/build_your_own_adapter.md +++ b/tutorials/guides/build_your_own_adapter.md @@ -183,7 +183,7 @@ The base model's tokenizer and generation assets (`generation_config.json`, `mer ## Step 4: Use the Composed Model -> **Note:** Custom (BYOA) adapters are not supported by [Mellea](https://github.com/generative-computing/mellea). Mellea only supports the official IBM Granite Library adapters. To invoke your custom adapters, use the chat template directly as shown below. +> **Note:** The high-level Mellea wrappers (`guardian_check`, `rag.rewrite_question`, etc.) are built for the official IBM Granite Library adapters. Custom adapters can be invoked through Mellea's lower-level `Intrinsic` API — see [Bring Your Own Adapter with Mellea](mellea_build_your_own_adapter.md). To invoke adapters without Mellea at all, use the chat template directly as shown below. ### With HuggingFace