You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-1Lines changed: 6 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -214,7 +214,12 @@ All fields with defaults. Only `caption` is required. Built-in modes (text2music
214
214
Key fields: `seed` -1 means random (resolved once, then +1 per batch
215
215
element). `audio_codes` is generated by ace-qwen3 and consumed by
216
216
dit-vae (comma separated FSQ token IDs). When present, the LLM is
217
-
skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style. `src_audio`: not yet implemented (see docs/MODES.md).
217
+
skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style (MP3 decoded in memory; encoded via built-in VAE encoder; requires VAE GGUF with encoder weights). `src_audio`: path to a **WAV or MP3** for cover source; dit-vae encodes it (VAE + FSQ nearest-codeword) to codes internally, no Python required (see docs/MODES.md).
218
+
219
+
**Reference and cover strength (not the same as guidance_scale):**
220
+
-**`audio_cover_strength`** (0.0–1.0): Controls how strongly the **cover/source** (from `audio_codes` or `src_audio`) influences the DiT context. The context is blended with silence: `(1 - audio_cover_strength)*silence + audio_cover_strength*decoded`. Use 1.0 for full cover influence, lower values to soften it. Only applies when cover context is present.
221
+
-**`reference_audio`**: Timbre from the reference file is applied at full strength; there is no separate strength parameter for reference timbre.
222
+
-**`guidance_scale`**: This is **DiT classifier-free guidance** (conditioned vs unconditioned prediction), not reference or cover strength. Turbo models ignore it (forced to 1.0).
218
223
219
224
Turbo preset: `inference_steps=8, shift=3.0` (no guidance_scale, turbo models don't use CFG).
-**Timbre**: Always uses built-in silence latent from the DiT GGUF (no user reference yet).
26
-
27
-
### cover (when `audio_codes` are provided)
28
-
-**Input**: Same as text2music, plus **precomputed**`audio_codes` (e.g. from a previous run or from Python).
29
-
-**Flow**: Skip LM; decode `audio_codes` to latents → DiT context = decoded + silence padding → DiT → VAE → WAV.
30
-
-**Limitation**: We do **not** convert a WAV file into `audio_codes`. So “cover from a file” is only possible if you already have codes (e.g. from Python or from a prior `ace-qwen3` run). The request fields `reference_audio` and `src_audio` are accepted in JSON but **not yet used** in the pipeline.
### cover (when `audio_codes` or `src_audio` are provided)
28
+
-**Input**: Same as text2music, plus either **precomputed**`audio_codes` or **`src_audio`** (WAV/MP3 path). Optional **reference_audio** for timbre.
29
+
-**Flow**: If `src_audio` set and no `audio_codes`: load WAV/MP3 → VAE encode → FSQ nearest-codeword encode → codes. Then decode codes to latents → DiT context (blend with silence) → DiT → VAE → WAV. No Python.
30
+
-**reference_audio** and **audio_cover_strength**: Implemented (timbre; blend).
32
31
---
33
32
34
33
## What’s not implemented yet
35
34
36
-
### reference_audio (global timbre/style)
37
-
-**Tutorial**: Load WAV → stereo 48 kHz, pad/repeat to ≥30 s → **VAE encode** → latents → feed as timbre condition into DiT.
38
-
-**C++**: Implemented. Set `reference_audio` to a **WAV or MP3 file path**. dit-vae loads the file (WAV: any sample rate resampled to 48 kHz; MP3: decoded in memory via header-only minimp3, no temp files, then resampled to 48 kHz if needed), runs the **VAE encoder** (Oobleck, in C++ in `vae.h`), and feeds the 64-d latents to the CondEncoder timbre path. No Python, no external deps. Requires a **full VAE GGUF** that includes `encoder.*` tensors (decoder-only GGUFs will print a clear error).
39
-
-**audio_cover_strength** (0.0–1.0): Implemented. When `audio_codes` are present, context latents are blended with silence: `(1 - strength)*silence + strength*decoded`.
40
-
41
-
### src_audio (Cover from file)
42
-
-**Tutorial**: Source audio is converted to **semantic codes** (melody, rhythm, chords, etc.); then DiT uses those as in cover mode.
43
-
-**C++**: That implies **audio → codes**. Likely path: WAV → VAE encode → **FSQ tokenizer** (latents → 5 Hz codes). We have the **FSQ detokenizer** (codes → latents); the tokenizer (encode) side would need to be added. Then: `src_audio` path → load WAV → VAE encode → FSQ encode → `audio_codes` → existing cover path.
44
-
45
-
### audio_cover_strength
46
-
-**Tutorial**: 0.0–1.0, how strongly generation follows reference/codes.
47
-
-**C++**: Field is in the request and parsed; no blending logic in the DiT/context path yet.
48
-
49
35
### repaint
50
36
-**Tutorial**: Specify `repainting_start` / `repainting_end` (seconds); model uses source audio as context and only generates in that interval (3–90 s).
51
37
-**C++**: Would require **masked diffusion**: context carries “given” frames; ODE only updates the repaint region. DiT’s context has a 64-channel “mask” that we currently set to 1.0; repaint would set mask per frame and the generation loop would only update unmasked frames. Not implemented.
@@ -60,9 +46,9 @@ All of these are in `AceRequest` and parsed from / written to JSON. Backend beha
0 commit comments