Skip to content

Commit e860c79

Browse files
author
qxip
committed
Cover from file (src_audio), docs, README strength clarification
- src_audio: load WAV/MP3, VAE encode, FSQ nearest-codeword encode to codes (fsq-detok.h: codeword table + latent_frames_to_codes; dit-vae: wire path) - reference_audio + cover (audio_codes/src_audio) fully supported without Python - MODES.md: cover and reference_audio marked supported; request table updated - README: clarify audio_cover_strength vs guidance_scale vs reference_audio (audio_cover_strength = cover blend; reference_audio = no strength knob; guidance_scale = DiT CFG, separate) Made-with: Cursor
1 parent 5a1e7a3 commit e860c79

4 files changed

Lines changed: 94 additions & 29 deletions

File tree

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,12 @@ All fields with defaults. Only `caption` is required. Built-in modes (text2music
214214
Key fields: `seed` -1 means random (resolved once, then +1 per batch
215215
element). `audio_codes` is generated by ace-qwen3 and consumed by
216216
dit-vae (comma separated FSQ token IDs). When present, the LLM is
217-
skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style. `src_audio`: not yet implemented (see docs/MODES.md).
217+
skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style (MP3 decoded in memory; encoded via built-in VAE encoder; requires VAE GGUF with encoder weights). `src_audio`: path to a **WAV or MP3** for cover source; dit-vae encodes it (VAE + FSQ nearest-codeword) to codes internally, no Python required (see docs/MODES.md).
218+
219+
**Reference and cover strength (not the same as guidance_scale):**
220+
- **`audio_cover_strength`** (0.0–1.0): Controls how strongly the **cover/source** (from `audio_codes` or `src_audio`) influences the DiT context. The context is blended with silence: `(1 - audio_cover_strength)*silence + audio_cover_strength*decoded`. Use 1.0 for full cover influence, lower values to soften it. Only applies when cover context is present.
221+
- **`reference_audio`**: Timbre from the reference file is applied at full strength; there is no separate strength parameter for reference timbre.
222+
- **`guidance_scale`**: This is **DiT classifier-free guidance** (conditioned vs unconditioned prediction), not reference or cover strength. Turbo models ignore it (forced to 1.0).
218223

219224
Turbo preset: `inference_steps=8, shift=3.0` (no guidance_scale, turbo models don't use CFG).
220225
SFT preset: `inference_steps=50, guidance_scale=4.0, shift=6.0`.

docs/MODES.md

Lines changed: 11 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This document maps the [ACE-Step 1.5 Tutorial](https://github.com/ace-step/ACE-S
77
| `task_type` | Description | Turbo/SFT | Base only | C++ status |
88
|---------------|-------------|-----------|-----------|------------|
99
| **text2music** | Generate from caption/lyrics (and optional reference) |||**Supported** |
10-
| **cover** | Re-synthesize with structure from source; optional timbre from reference ||| ⚠️ **Partial** (see below) |
10+
| **cover** | Re-synthesize with structure from source; optional timbre from reference ||| **Supported** (audio_codes or src_audio WAV/MP3) |
1111
| **repaint** | Local edit in time range using source as context ||| ❌ Not implemented |
1212
| **lego** | Add new tracks to existing audio ||| ❌ Base model only |
1313
| **extract** | Extract single track from mix ||| ❌ Base model only |
@@ -22,30 +22,16 @@ We only ship Turbo and SFT DiT weights; **lego**, **extract**, **complete** requ
2222
### text2music (default)
2323
- **Input**: `caption`, optional `lyrics`, metadata (bpm, duration, keyscale, …).
2424
- **Flow**: LM (optional) → CoT + audio codes → DiT (context = silence) → VAE → WAV.
25-
- **Timbre**: Always uses built-in silence latent from the DiT GGUF (no user reference yet).
26-
27-
### cover (when `audio_codes` are provided)
28-
- **Input**: Same as text2music, plus **precomputed** `audio_codes` (e.g. from a previous run or from Python).
29-
- **Flow**: Skip LM; decode `audio_codes` to latents → DiT context = decoded + silence padding → DiT → VAE → WAV.
30-
- **Limitation**: We do **not** convert a WAV file into `audio_codes`. So “cover from a file” is only possible if you already have codes (e.g. from Python or from a prior `ace-qwen3` run). The request fields `reference_audio` and `src_audio` are accepted in JSON but **not yet used** in the pipeline.
25+
- **Timbre**: Optional **reference_audio** (WAV/MP3) → VAE encode → CondEncoder timbre; else built-in silence.
3126

27+
### cover (when `audio_codes` or `src_audio` are provided)
28+
- **Input**: Same as text2music, plus either **precomputed** `audio_codes` or **`src_audio`** (WAV/MP3 path). Optional **reference_audio** for timbre.
29+
- **Flow**: If `src_audio` set and no `audio_codes`: load WAV/MP3 → VAE encode → FSQ nearest-codeword encode → codes. Then decode codes to latents → DiT context (blend with silence) → DiT → VAE → WAV. No Python.
30+
- **reference_audio** and **audio_cover_strength**: Implemented (timbre; blend).
3231
---
3332

3433
## What’s not implemented yet
3534

36-
### reference_audio (global timbre/style)
37-
- **Tutorial**: Load WAV → stereo 48 kHz, pad/repeat to ≥30 s → **VAE encode** → latents → feed as timbre condition into DiT.
38-
- **C++**: Implemented. Set `reference_audio` to a **WAV or MP3 file path**. dit-vae loads the file (WAV: any sample rate resampled to 48 kHz; MP3: decoded in memory via header-only minimp3, no temp files, then resampled to 48 kHz if needed), runs the **VAE encoder** (Oobleck, in C++ in `vae.h`), and feeds the 64-d latents to the CondEncoder timbre path. No Python, no external deps. Requires a **full VAE GGUF** that includes `encoder.*` tensors (decoder-only GGUFs will print a clear error).
39-
- **audio_cover_strength** (0.0–1.0): Implemented. When `audio_codes` are present, context latents are blended with silence: `(1 - strength)*silence + strength*decoded`.
40-
41-
### src_audio (Cover from file)
42-
- **Tutorial**: Source audio is converted to **semantic codes** (melody, rhythm, chords, etc.); then DiT uses those as in cover mode.
43-
- **C++**: That implies **audio → codes**. Likely path: WAV → VAE encode → **FSQ tokenizer** (latents → 5 Hz codes). We have the **FSQ detokenizer** (codes → latents); the tokenizer (encode) side would need to be added. Then: `src_audio` path → load WAV → VAE encode → FSQ encode → `audio_codes` → existing cover path.
44-
45-
### audio_cover_strength
46-
- **Tutorial**: 0.0–1.0, how strongly generation follows reference/codes.
47-
- **C++**: Field is in the request and parsed; no blending logic in the DiT/context path yet.
48-
4935
### repaint
5036
- **Tutorial**: Specify `repainting_start` / `repainting_end` (seconds); model uses source audio as context and only generates in that interval (3–90 s).
5137
- **C++**: Would require **masked diffusion**: context carries “given” frames; ODE only updates the repaint region. DiT’s context has a 64-channel “mask” that we currently set to 1.0; repaint would set mask per frame and the generation loop would only update unmasked frames. Not implemented.
@@ -60,9 +46,9 @@ All of these are in `AceRequest` and parsed from / written to JSON. Backend beha
6046
|-------|------|--------|
6147
| `task_type` | string | `"text2music"` \| `"cover"` \| `"repaint"` \||
6248
| `reference_audio` | string | Path to WAV or MP3 for timbre (implemented) |
63-
| `src_audio` | string | Path to WAV for cover/repaint source (not used yet) |
64-
| `audio_codes` | string | Comma-separated FSQ codes; non-empty ⇒ cover path |
65-
| `audio_cover_strength` | float | 0.0–1.0 (parsed, not used yet) |
49+
| `src_audio` | string | Path to WAV or MP3 for cover source; encoded to codes internally (implemented) |
50+
| `audio_codes` | string | Comma-separated FSQ codes; non-empty ⇒ cover path (or from `src_audio`) |
51+
| `audio_cover_strength` | float | 0.0–1.0 blend of decoded context with silence (implemented) |
6652
| `repainting_start` | float | Start time (s) for repaint (not used yet) |
6753
| `repainting_end` | float | End time (s) for repaint (not used yet) |
6854

@@ -72,8 +58,6 @@ See `request.h` and the README “Request JSON reference” for the full list.
7258

7359
## Summary
7460

75-
- **Fully supported**: text2music; cover when you supply **precomputed** `audio_codes`.
76-
- **Schema only** (no backend): `task_type`, `reference_audio`, `src_audio`, `audio_cover_strength`, `repainting_start`/`repainting_end`.
77-
- **To support reference_audio**: add VAE encoder, then feed its output into the existing CondEncoder timbre path.
78-
- **To support cover from file**: add VAE encoder + FSQ tokenizer (or equivalent audio→codes), then reuse existing cover path.
61+
- **Fully supported**: text2music (with optional reference_audio for timbre); cover from **precomputed** `audio_codes` or from **WAV/MP3** via `src_audio` (VAE encode + FSQ nearest-codeword encode); reference_audio (timbre); audio_cover_strength (blend).
62+
- **Schema only** (no backend): `repainting_start`/`repainting_end`.
7963
- **To support repaint**: implement masked DiT generation (context mask + ODE only on repaint interval).

src/fsq-detok.h

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,13 @@
1010

1111
#pragma once
1212
#include "qwen3-enc.h"
13+
#include <vector>
1314

1415
// FSQ constants
1516
static const int FSQ_NDIMS = 6;
1617
static const int FSQ_LEVELS[6] = {8, 8, 8, 5, 5, 5};
18+
static const int FSQ_N_CODES = 8 * 8 * 8 * 5 * 5 * 5; // 8000
19+
static const int FSQ_FRAMES_PER_CODE = 5;
1720

1821
// FSQ decode: integer index -> 6 normalized float values
1922
// Each dimension: level_idx / ((L-1)/2) - 1.0 (maps to [-1, 1])
@@ -214,6 +217,48 @@ static int detok_ggml_decode(DetokGGML * m, const int * codes, int T_5Hz,
214217
return T_25Hz;
215218
}
216219

220+
// Build codeword table for latent->code (cover from file): for each code 0..FSQ_N_CODES-1,
221+
// decode to 5*64 floats. table_out must be at least FSQ_N_CODES * FSQ_FRAMES_PER_CODE * 64 floats.
222+
static void detok_ggml_build_codeword_table(DetokGGML * m, float * table_out) {
223+
const int chunk = FSQ_FRAMES_PER_CODE * 64;
224+
for (int i = 0; i < FSQ_N_CODES; i++) {
225+
int n = detok_ggml_decode(m, &i, 1, table_out + (size_t)i * chunk);
226+
(void)n;
227+
}
228+
}
229+
230+
// Encode latent frames to 5Hz codes by nearest codeword. T_latent = number of 25Hz frames (64-d each).
231+
// Groups frames in chunks of 5; for each chunk finds the code whose codeword minimizes L2 distance.
232+
// codeword_table from detok_ggml_build_codeword_table (FSQ_N_CODES * 5 * 64 floats).
233+
// Pads last chunk with zeros if T_latent not divisible by 5.
234+
static void latent_frames_to_codes(int T_latent, const float * latent_64d,
235+
const float * codeword_table,
236+
std::vector<int> * out_codes) {
237+
out_codes->clear();
238+
const int chunk_frames = FSQ_FRAMES_PER_CODE;
239+
const int chunk_size = chunk_frames * 64;
240+
int n_chunks = T_latent / chunk_frames;
241+
if (n_chunks <= 0) return;
242+
for (int g = 0; g < n_chunks; g++) {
243+
const float * chunk = latent_64d + (size_t)g * chunk_size;
244+
int best = 0;
245+
float best_d2 = 1e30f;
246+
for (int i = 0; i < FSQ_N_CODES; i++) {
247+
const float * cw = codeword_table + (size_t)i * chunk_size;
248+
float d2 = 0.0f;
249+
for (int j = 0; j < chunk_size; j++) {
250+
float d = chunk[j] - cw[j];
251+
d2 += d * d;
252+
}
253+
if (d2 < best_d2) {
254+
best_d2 = d2;
255+
best = i;
256+
}
257+
}
258+
out_codes->push_back(best);
259+
}
260+
}
261+
217262
// Free
218263
static void detok_ggml_free(DetokGGML * m) {
219264
if (m->sched) ggml_backend_sched_free(m->sched);

tools/dit-vae.cpp

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -277,8 +277,39 @@ int main(int argc, char ** argv) {
277277
fprintf(stderr, "[Pipeline] seed=%lld, steps=%d, guidance=%.1f, shift=%.1f, duration=%.1fs\n",
278278
seed, num_steps, guidance_scale, shift, duration);
279279

280-
// Parse audio codes from request
280+
// Parse audio codes from request (or produce from src_audio WAV/MP3)
281281
std::vector<int> codes_vec = parse_codes_string(req.audio_codes);
282+
if (codes_vec.empty() && !req.src_audio.empty() && have_vae) {
283+
const std::string & src_path = req.src_audio;
284+
std::vector<float> wav_stereo;
285+
int n_samples = load_audio_48k_stereo(src_path.c_str(), &wav_stereo);
286+
if (n_samples > 0) {
287+
int T_audio = n_samples;
288+
if (T_audio >= 1920) {
289+
VAEEncoderGGML enc = {};
290+
if (vae_encoder_load(&enc, vae_gguf)) {
291+
size_t max_lat = (size_t)(T_audio / 2048) + 1;
292+
std::vector<float> enc_out(max_lat * 64);
293+
int T_lat = vae_encoder_forward(&enc, wav_stereo.data(), T_audio, enc_out.data());
294+
vae_encoder_free(&enc);
295+
if (T_lat >= FSQ_FRAMES_PER_CODE) {
296+
DetokGGML detok = {};
297+
if (detok_ggml_load(&detok, dit_gguf, model.backend, model.cpu_backend)) {
298+
std::vector<float> codeword_table((size_t)FSQ_N_CODES * FSQ_FRAMES_PER_CODE * 64);
299+
fprintf(stderr, "[Cover] building FSQ codeword table (8000 codes)...\n");
300+
detok_ggml_build_codeword_table(&detok, codeword_table.data());
301+
latent_frames_to_codes(T_lat, enc_out.data(), codeword_table.data(), &codes_vec);
302+
fprintf(stderr, "[Cover] encoded %s -> %zu codes (%.1fs @ 5Hz)\n",
303+
src_path.c_str(), codes_vec.size(), (float)codes_vec.size() / 5.0f);
304+
detok_ggml_free(&detok);
305+
}
306+
}
307+
}
308+
}
309+
} else {
310+
fprintf(stderr, "[Cover] WARNING: cannot load src_audio %s (use .wav or .mp3), skipping cover-from-file\n", src_path.c_str());
311+
}
312+
}
282313
if (!codes_vec.empty())
283314
fprintf(stderr, "[Pipeline] %zu audio codes (%.1fs @ 5Hz)\n",
284315
codes_vec.size(), (float)codes_vec.size() / 5.0f);

0 commit comments

Comments
 (0)