From af8ffdbcfb1f3c2ed49e51233904414f4c8e0517 Mon Sep 17 00:00:00 2001 From: ZaynJarvis Date: Mon, 18 May 2026 19:41:20 +0800 Subject: [PATCH] RFC: Seedance audio stability for video skills --- .../2026-05-18-seedance-audio-stability.md | 274 ++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 docs/superpowers/rfcs/2026-05-18-seedance-audio-stability.md diff --git a/docs/superpowers/rfcs/2026-05-18-seedance-audio-stability.md b/docs/superpowers/rfcs/2026-05-18-seedance-audio-stability.md new file mode 100644 index 0000000..3bac3af --- /dev/null +++ b/docs/superpowers/rfcs/2026-05-18-seedance-audio-stability.md @@ -0,0 +1,274 @@ +# RFC: Seedance Audio Stability For AIGC Video Skills + +Status: Draft +Author: luna +Date: 2026-05-18 + +## Summary + +Agentara's current video-related skills cover planning, character design, +storyboarding, and poster design. They do not yet define a production workflow +for stable generated voices across multiple AI-video segments. + +This RFC proposes an audio-stability workflow for Seedance-style video +generation, based on the Dodo/Wangwang 45-second production run. + +## Current Agentara Skill Coverage + +Relevant skills inspected: + +- `user-home/.claude/skills/video-plan/SKILL.md` + - Covers duration, scenes, dialogue/voiceover suggestions, and editing notes. + - Does not describe voice identity references, speaker mapping, or audio + reference assets. +- `user-home/.claude/skills/video-storyboard/SKILL.md` + - Covers storyboard images and per-scene video prompts, including sound + design. + - Does not define native generated-audio preservation, audio-reference + preflight, or multi-segment voice consistency rules. +- `user-home/.claude/skills/video-character-design/SKILL.md` + - Covers visual character identity. + - Does not model character voice identity. +- No dedicated Agentara default skill was found for Seedance video generation + with reference audio. + +Conclusion: Agentara currently has good visual pre-production skills, but no +explicit audio-stability skill layer for recurring speaking characters. + +## Seedance Constraints Observed + +These are empirical constraints and workflow lessons from the Dodo/Wangwang +production. Treat them as conservative production rules until better provider +documentation or tests supersede them. + +### 1. Prompt-only voice locking is insufficient + +Stable voice identity requires a reference audio input: + +- send each voice as an audio URL with `role: "reference_audio"`; +- set `generate_audio: true`; +- bind each reference to a named character in the prompt. + +The prompt is still required, but its job is binding, not voice identity by +itself. Example: + +```text +Audio reference 1 is Dodo's voice. Dodo must always use audio reference 1. +Audio reference 2 is Wangwang's voice. Wangwang must always use audio reference 2. +Only these named speakers talk. Use the exact dialogue below. +``` + +### 2. Multi-reference audio has a total duration limit + +A validation request with two 15-second references failed. The error required +the combined reference-audio duration to be about `<=15.2s`. + +Production rule: + +- keep a full 15-second or longer voice record for review and archive; +- cut a clean short reference clip for Seedance generation; +- for two speakers, use about 7 seconds per speaker; +- verify total reference-audio duration before submission. + +### 3. Audio-reference mode should not mix with strict first/last frames + +In the observed Studio/Seedance path, reference audio is a multimodal reference +input. It should not be combined with strict `first_frame` / `last_frame` image +control in the same task. + +Production rule: + +- for voice-stable native audio, use `reference_image` + `reference_audio`; +- do not submit strict first/last-frame inputs in the same task; +- if transition control is required, run a separate visual-control pass or use + returned/generated last frames after the native audio pass. + +### 4. Preserve native audio during assembly + +For a native voice-consistency test, final assembly must preserve each +Seedance-generated segment's original audio. + +Do not replace generated segment audio with external TTS/SFX/music in the same +deliverable. A post-produced audio version can exist as a separate fallback, but +it should be labeled as such. + +### 5. Reference assets must be durable remote URLs + +Seedance/Ark must be able to fetch every reference URL at submission time. In +the Dodo/Wangwang run, two 7-second reference-audio URLs temporarily returned +404, and every segment failed immediately with an `audio_url resource not +found` error. + +Production rule: + +- store reference audio in the media repo, not in temporary local files; +- verify direct URL reachability before submitting; +- keep local source clips so the same hash/object can be re-uploaded if needed; +- record URLs and hashes in a manifest. + +## Proposed Agentara Workflow + +Add a voice-stability workflow to Agentara's AIGC video skills. + +### Phase 1: Voice Asset Selection + +For each recurring speaking character: + +1. Generate or select a clean full voice record. +2. Store it in the media repo as the canonical voice record. +3. Cut a short Seedance reference clip. +4. Store the short clip in the media repo. +5. Record both URLs in a project voice registry. + +Example registry shape: + +```json +{ + "characters": { + "dodo": { + "full_audio_url": "https://image.zaynjarvis.com/i/uploads/studio/sha256/...", + "reference_audio_url": "https://image.zaynjarvis.com/i/uploads/studio/sha256/...", + "reference_duration_seconds": 7.0, + "prompt_voice": "cute young female Mandarin voice, bright and soft" + }, + "wangwang": { + "full_audio_url": "https://image.zaynjarvis.com/i/uploads/studio/sha256/...", + "reference_audio_url": "https://image.zaynjarvis.com/i/uploads/studio/sha256/...", + "reference_duration_seconds": 7.0, + "prompt_voice": "natural young Mandarin boy voice, warm and earnest" + } + } +} +``` + +Recommended location: + +```text +workspace/projects//voices.json +``` + +### Phase 2: Video Plan Integration + +Extend `video-plan` output with optional audio fields when dialogue or recurring +voices matter: + +- character voice registry path; +- spoken language; +- per-scene speaker list; +- line-level dialogue; +- whether background music is deferred; +- whether the pass is `native_audio_validation` or `post_audio_fallback`. + +### Phase 3: Storyboard / Prompt Integration + +Extend `video-storyboard` prompt scripts with an optional "Voice References" +section: + +```markdown +## Voice References + +- Audio reference 1: Dodo, cute young female Mandarin voice. +- Audio reference 2: Wangwang, natural young Mandarin boy voice. + +Rules: +- Dodo only uses audio reference 1. +- Wangwang only uses audio reference 2. +- No narrator. +- No background music in the native-audio pass. +``` + +Each dialogue line should have an explicit speaker label. + +### Phase 4: Generation Preflight + +A future Seedance generation skill or helper should preflight: + +- reference-audio URLs return HTTP 200; +- MIME type is audio; +- max reference-audio count is 3; +- known total reference-audio duration is about `<=15.2s`; +- no local-only reference paths are passed to provider APIs; +- reference audio is not mixed with strict first/last-frame controls; +- each segment prompt binds every supplied audio reference to a speaker. + +### Phase 5: Assembly Verification + +For multi-segment native-audio output: + +1. Download every successful segment immediately. +2. Probe every segment with `ffprobe`. +3. Require an audio stream in each segment. +4. Concatenate original generated segment audio. +5. Probe final duration, dimensions, and audio stream. +6. Create a review copy and contact sheet. + +## Dodo/Wangwang Reference Case + +Canonical full voices selected by the user: + +- Dodo: + `https://image.zaynjarvis.com/i/uploads/studio/sha256/094ff97848cd56f67291e43361ce53818f11c10c96b72b1241911f390a495005` +- Wangwang: + `https://image.zaynjarvis.com/i/uploads/studio/sha256/e6f687af057d3e7b1942d54462a9b2a6dbe8653400d36d665e09c0acc58ad631` + +Short Seedance references: + +- Dodo 7s: + `https://image.zaynjarvis.com/i/uploads/studio/sha256/8f5974daa14a8425a003b4a7c9b17533b8fbfb4572b39bf483245d7a034ab963` +- Wangwang 7s: + `https://image.zaynjarvis.com/i/uploads/studio/sha256/191a880ba717e103afa4a798f0dc3d422db8529c534a85fb95bb1968f5d3451d` + +Successful 45-second native-audio video: + +`https://image.zaynjarvis.com/i/uploads/studio/sha256/de4eaacd66f3df3c15c2c1ac0d3eef41d2c4600ffb1e9b86f3c9c04ab5e4dd10` + +## Proposed Changes + +### Documentation + +- Add this RFC under `docs/superpowers/rfcs/`. +- Add short audio-stability notes to `video-plan` and `video-storyboard`. +- Add a new dedicated skill later if Seedance generation is installed into + Agentara default skills. + +### Skill Behavior + +For any future AIGC video generation skill: + +- recognize "fixed voice", "same voice", "voice consistency", "reference + audio", "native audio", and "Seedance voice" as triggers; +- require or infer a voice registry for recurring characters; +- use a "native-audio validation" pass before accepting external audio + replacement; +- keep post-produced TTS/music as a separate optional pass. + +### CLI / Helper Behavior + +If Agentara ships a Seedance helper script, add: + +```bash +--reference-audio URL repeatable +--reference-audio-label NAME repeatable +--voice-registry FILE +--voice-stable enables stricter preflight +--dry-run prints payload without submitting +``` + +## Validation Plan + +1. Unit-test prompt/registry parsing if helper code is added. +2. Preflight-test unreachable audio URL, overlong reference duration, too many + references, and mixed first/last-frame + audio inputs. +3. Run a short two-speaker Seedance validation clip. +4. Run a multi-segment 45-second native-audio production. +5. Verify final assembly keeps original generated audio. + +## Open Questions + +- Is the `<=15.2s` limit always total across all reference audio clips, or does + it vary by model/account? Treat it as total until proven otherwise. +- Should Agentara store voice registries under `workspace/projects//` + or in `memory/` for reuse across projects? +- Should the first implementation be a dedicated `seedance-video-generation` + skill, or should existing `video-plan` and `video-storyboard` carry the + workflow until a provider-specific generator exists?