Cover from file (src_audio), docs, README strength clarification

qxip · qxip · commit e860c79d45ce · 2026-02-28T23:47:00.000+01:00
- src_audio: load WAV/MP3, VAE encode, FSQ nearest-codeword encode to codes
  (fsq-detok.h: codeword table + latent_frames_to_codes; dit-vae: wire path)
- reference_audio + cover (audio_codes/src_audio) fully supported without Python
- MODES.md: cover and reference_audio marked supported; request table updated
- README: clarify audio_cover_strength vs guidance_scale vs reference_audio
  (audio_cover_strength = cover blend; reference_audio = no strength knob;
   guidance_scale = DiT CFG, separate)

Made-with: Cursor
diff --git a/README.md b/README.md
@@ -214,7 +214,12 @@ All fields with defaults. Only `caption` is required. Built-in modes (text2music
 Key fields: `seed` -1 means random (resolved once, then +1 per batch
 element). `audio_codes` is generated by ace-qwen3 and consumed by
 dit-vae (comma separated FSQ token IDs). When present, the LLM is
-skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style. `src_audio`: not yet implemented (see docs/MODES.md).
+skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style (MP3 decoded in memory; encoded via built-in VAE encoder; requires VAE GGUF with encoder weights). `src_audio`: path to a **WAV or MP3** for cover source; dit-vae encodes it (VAE + FSQ nearest-codeword) to codes internally, no Python required (see docs/MODES.md).
+
+**Reference and cover strength (not the same as guidance_scale):**
+- **`audio_cover_strength`** (0.0–1.0): Controls how strongly the **cover/source** (from `audio_codes` or `src_audio`) influences the DiT context. The context is blended with silence: `(1 - audio_cover_strength)*silence + audio_cover_strength*decoded`. Use 1.0 for full cover influence, lower values to soften it. Only applies when cover context is present.
+- **`reference_audio`**: Timbre from the reference file is applied at full strength; there is no separate strength parameter for reference timbre.
+- **`guidance_scale`**: This is **DiT classifier-free guidance** (conditioned vs unconditioned prediction), not reference or cover strength. Turbo models ignore it (forced to 1.0).
 
 Turbo preset: `inference_steps=8, shift=3.0` (no guidance_scale, turbo models don't use CFG).
 SFT preset: `inference_steps=50, guidance_scale=4.0, shift=6.0`.
diff --git a/docs/MODES.md b/docs/MODES.md
@@ -7,7 +7,7 @@ This document maps the [ACE-Step 1.5 Tutorial](https://github.com/ace-step/ACE-S
 | `task_type`   | Description | Turbo/SFT | Base only | C++ status |
 |---------------|-------------|-----------|-----------|------------|
 | **text2music** | Generate from caption/lyrics (and optional reference) | ✅ | — | ✅ **Supported** |
-| **cover**      | Re-synthesize with structure from source; optional timbre from reference | ✅ | — | ⚠️ **Partial** (see below) |
+| **cover**      | Re-synthesize with structure from source; optional timbre from reference | ✅ | — | ✅ **Supported** (audio_codes or src_audio WAV/MP3) |
 | **repaint**    | Local edit in time range using source as context | ✅ | — | ❌ Not implemented |
 | **lego**       | Add new tracks to existing audio | — | ✅ | ❌ Base model only |
 | **extract**    | Extract single track from mix | — | ✅ | ❌ Base model only |
@@ -22,30 +22,16 @@ We only ship Turbo and SFT DiT weights; **lego**, **extract**, **complete** requ
 ### text2music (default)
 - **Input**: `caption`, optional `lyrics`, metadata (bpm, duration, keyscale, …).
 - **Flow**: LM (optional) → CoT + audio codes → DiT (context = silence) → VAE → WAV.
-- **Timbre**: Always uses built-in silence latent from the DiT GGUF (no user reference yet).
-
-### cover (when `audio_codes` are provided)
-- **Input**: Same as text2music, plus **precomputed** `audio_codes` (e.g. from a previous run or from Python).
-- **Flow**: Skip LM; decode `audio_codes` to latents → DiT context = decoded + silence padding → DiT → VAE → WAV.
-- **Limitation**: We do **not** convert a WAV file into `audio_codes`. So “cover from a file” is only possible if you already have codes (e.g. from Python or from a prior `ace-qwen3` run). The request fields `reference_audio` and `src_audio` are accepted in JSON but **not yet used** in the pipeline.
+- **Timbre**: Optional **reference_audio** (WAV/MP3) → VAE encode → CondEncoder timbre; else built-in silence.
 
+### cover (when `audio_codes` or `src_audio` are provided)
+- **Input**: Same as text2music, plus either **precomputed** `audio_codes` or **`src_audio`** (WAV/MP3 path). Optional **reference_audio** for timbre.
+- **Flow**: If `src_audio` set and no `audio_codes`: load WAV/MP3 → VAE encode → FSQ nearest-codeword encode → codes. Then decode codes to latents → DiT context (blend with silence) → DiT → VAE → WAV. No Python.
+- **reference_audio** and **audio_cover_strength**: Implemented (timbre; blend).
 ---
 
 ## What’s not implemented yet
 
-### reference_audio (global timbre/style)
-- **Tutorial**: Load WAV → stereo 48 kHz, pad/repeat to ≥30 s → **VAE encode** → latents → feed as timbre condition into DiT.
-- **C++**: Implemented. Set `reference_audio` to a **WAV or MP3 file path**. dit-vae loads the file (WAV: any sample rate resampled to 48 kHz; MP3: decoded in memory via header-only minimp3, no temp files, then resampled to 48 kHz if needed), runs the **VAE encoder** (Oobleck, in C++ in `vae.h`), and feeds the 64-d latents to the CondEncoder timbre path. No Python, no external deps. Requires a **full VAE GGUF** that includes `encoder.*` tensors (decoder-only GGUFs will print a clear error).
-- **audio_cover_strength** (0.0–1.0): Implemented. When `audio_codes` are present, context latents are blended with silence: `(1 - strength)*silence + strength*decoded`.
-
-### src_audio (Cover from file)
-- **Tutorial**: Source audio is converted to **semantic codes** (melody, rhythm, chords, etc.); then DiT uses those as in cover mode.
-- **C++**: That implies **audio → codes**. Likely path: WAV → VAE encode → **FSQ tokenizer** (latents → 5 Hz codes). We have the **FSQ detokenizer** (codes → latents); the tokenizer (encode) side would need to be added. Then: `src_audio` path → load WAV → VAE encode → FSQ encode → `audio_codes` → existing cover path.
-
-### audio_cover_strength
-- **Tutorial**: 0.0–1.0, how strongly generation follows reference/codes.
-- **C++**: Field is in the request and parsed; no blending logic in the DiT/context path yet.
-
 ### repaint
 - **Tutorial**: Specify `repainting_start` / `repainting_end` (seconds); model uses source audio as context and only generates in that interval (3–90 s).
 - **C++**: Would require **masked diffusion**: context carries “given” frames; ODE only updates the repaint region. DiT’s context has a 64-channel “mask” that we currently set to 1.0; repaint would set mask per frame and the generation loop would only update unmasked frames. Not implemented.
@@ -60,9 +46,9 @@ All of these are in `AceRequest` and parsed from / written to JSON. Backend beha
 |-------|------|--------|
 | `task_type` | string | `"text2music"` \| `"cover"` \| `"repaint"` \| … |
 | `reference_audio` | string | Path to WAV or MP3 for timbre (implemented) |
-| `src_audio` | string | Path to WAV for cover/repaint source (not used yet) |
-| `audio_codes` | string | Comma-separated FSQ codes; non-empty ⇒ cover path |
-| `audio_cover_strength` | float | 0.0–1.0 (parsed, not used yet) |
+| `src_audio` | string | Path to WAV or MP3 for cover source; encoded to codes internally (implemented) |
+| `audio_codes` | string | Comma-separated FSQ codes; non-empty ⇒ cover path (or from `src_audio`) |
+| `audio_cover_strength` | float | 0.0–1.0 blend of decoded context with silence (implemented) |
 | `repainting_start` | float | Start time (s) for repaint (not used yet) |
 | `repainting_end` | float | End time (s) for repaint (not used yet) |
 
@@ -72,8 +58,6 @@ See `request.h` and the README “Request JSON reference” for the full list.
 
 ## Summary
 
-- **Fully supported**: text2music; cover when you supply **precomputed** `audio_codes`.
-- **Schema only** (no backend): `task_type`, `reference_audio`, `src_audio`, `audio_cover_strength`, `repainting_start`/`repainting_end`.
-- **To support reference_audio**: add VAE encoder, then feed its output into the existing CondEncoder timbre path.
-- **To support cover from file**: add VAE encoder + FSQ tokenizer (or equivalent audio→codes), then reuse existing cover path.
+- **Fully supported**: text2music (with optional reference_audio for timbre); cover from **precomputed** `audio_codes` or from **WAV/MP3** via `src_audio` (VAE encode + FSQ nearest-codeword encode); reference_audio (timbre); audio_cover_strength (blend).
+- **Schema only** (no backend): `repainting_start`/`repainting_end`.
 - **To support repaint**: implement masked DiT generation (context mask + ODE only on repaint interval).
diff --git a/src/fsq-detok.h b/src/fsq-detok.h
@@ -10,10 +10,13 @@
 
 #pragma once
 #include "qwen3-enc.h"
+#include <vector>
 
 // FSQ constants
 static const int FSQ_NDIMS = 6;
 static const int FSQ_LEVELS[6] = {8, 8, 8, 5, 5, 5};
+static const int FSQ_N_CODES = 8 * 8 * 8 * 5 * 5 * 5;  // 8000
+static const int FSQ_FRAMES_PER_CODE = 5;
 
 // FSQ decode: integer index -> 6 normalized float values
 // Each dimension: level_idx / ((L-1)/2) - 1.0  (maps to [-1, 1])
@@ -214,6 +217,48 @@ static int detok_ggml_decode(DetokGGML * m, const int * codes, int T_5Hz,
     return T_25Hz;
 }
 
+// Build codeword table for latent->code (cover from file): for each code 0..FSQ_N_CODES-1,
+// decode to 5*64 floats. table_out must be at least FSQ_N_CODES * FSQ_FRAMES_PER_CODE * 64 floats.
+static void detok_ggml_build_codeword_table(DetokGGML * m, float * table_out) {
+    const int chunk = FSQ_FRAMES_PER_CODE * 64;
+    for (int i = 0; i < FSQ_N_CODES; i++) {
+        int n = detok_ggml_decode(m, &i, 1, table_out + (size_t)i * chunk);
+        (void)n;
+    }
+}
+
+// Encode latent frames to 5Hz codes by nearest codeword. T_latent = number of 25Hz frames (64-d each).
+// Groups frames in chunks of 5; for each chunk finds the code whose codeword minimizes L2 distance.
+// codeword_table from detok_ggml_build_codeword_table (FSQ_N_CODES * 5 * 64 floats).
+// Pads last chunk with zeros if T_latent not divisible by 5.
+static void latent_frames_to_codes(int T_latent, const float * latent_64d,
+                                   const float * codeword_table,
+                                   std::vector<int> * out_codes) {
+    out_codes->clear();
+    const int chunk_frames = FSQ_FRAMES_PER_CODE;
+    const int chunk_size = chunk_frames * 64;
+    int n_chunks = T_latent / chunk_frames;
+    if (n_chunks <= 0) return;
+    for (int g = 0; g < n_chunks; g++) {
+        const float * chunk = latent_64d + (size_t)g * chunk_size;
+        int best = 0;
+        float best_d2 = 1e30f;
+        for (int i = 0; i < FSQ_N_CODES; i++) {
+            const float * cw = codeword_table + (size_t)i * chunk_size;
+            float d2 = 0.0f;
+            for (int j = 0; j < chunk_size; j++) {
+                float d = chunk[j] - cw[j];
+                d2 += d * d;
+            }
+            if (d2 < best_d2) {
+                best_d2 = d2;
+                best = i;
+            }
+        }
+        out_codes->push_back(best);
+    }
+}
+
 // Free
 static void detok_ggml_free(DetokGGML * m) {
     if (m->sched) ggml_backend_sched_free(m->sched);
diff --git a/tools/dit-vae.cpp b/tools/dit-vae.cpp
@@ -277,8 +277,39 @@ int main(int argc, char ** argv) {
         fprintf(stderr, "[Pipeline] seed=%lld, steps=%d, guidance=%.1f, shift=%.1f, duration=%.1fs\n",
                 seed, num_steps, guidance_scale, shift, duration);
 
-        // Parse audio codes from request
+        // Parse audio codes from request (or produce from src_audio WAV/MP3)
         std::vector<int> codes_vec = parse_codes_string(req.audio_codes);
+        if (codes_vec.empty() && !req.src_audio.empty() && have_vae) {
+            const std::string & src_path = req.src_audio;
+            std::vector<float> wav_stereo;
+            int n_samples = load_audio_48k_stereo(src_path.c_str(), &wav_stereo);
+            if (n_samples > 0) {
+                int T_audio = n_samples;
+                if (T_audio >= 1920) {
+                    VAEEncoderGGML enc = {};
+                    if (vae_encoder_load(&enc, vae_gguf)) {
+                        size_t max_lat = (size_t)(T_audio / 2048) + 1;
+                        std::vector<float> enc_out(max_lat * 64);
+                        int T_lat = vae_encoder_forward(&enc, wav_stereo.data(), T_audio, enc_out.data());
+                        vae_encoder_free(&enc);
+                        if (T_lat >= FSQ_FRAMES_PER_CODE) {
+                            DetokGGML detok = {};
+                            if (detok_ggml_load(&detok, dit_gguf, model.backend, model.cpu_backend)) {
+                                std::vector<float> codeword_table((size_t)FSQ_N_CODES * FSQ_FRAMES_PER_CODE * 64);
+                                fprintf(stderr, "[Cover] building FSQ codeword table (8000 codes)...\n");
+                                detok_ggml_build_codeword_table(&detok, codeword_table.data());
+                                latent_frames_to_codes(T_lat, enc_out.data(), codeword_table.data(), &codes_vec);
+                                fprintf(stderr, "[Cover] encoded %s -> %zu codes (%.1fs @ 5Hz)\n",
+                                        src_path.c_str(), codes_vec.size(), (float)codes_vec.size() / 5.0f);
+                                detok_ggml_free(&detok);
+                            }
+                        }
+                    }
+                }
+            } else {
+                fprintf(stderr, "[Cover] WARNING: cannot load src_audio %s (use .wav or .mp3), skipping cover-from-file\n", src_path.c_str());
+            }
+        }
         if (!codes_vec.empty())
             fprintf(stderr, "[Pipeline] %zu audio codes (%.1fs @ 5Hz)\n",
                     codes_vec.size(), (float)codes_vec.size() / 5.0f);