Taking a stab at the first audio models with CLAP#297
Merged
Conversation
…rch support Add Audio as a first-class input type across the full stack: proto, Rust, Go, and Python SDKs. Introduce ClapAudio and ClapText model variants backed by laion/larger_clap_music_and_speech ONNX weights, embedding audio and text into a shared 512-dim space for audio+text multimodal similarity search. - protos: add audio to StoreInput and MetadataValue oneofs; add CLAP_AUDIO, CLAP_TEXT to AIModel enum and AUDIO to AIStoreInputType - types: wire Audio through MetadataValue serde (aud: prefix), StoreInput conversion, and AiStoreInputType mapping - ai: add ModelType::Audio, ORTModality::Audio, ModelInput::Audios, and ORTAudioPreprocessor (symphonia decode → mono mix → rubato resample to 48kHz → pad/truncate to 10s → batched ndarray) - ai: add batch_inference_audio to SingleStageModel with L2 normalisation - dsl: add audio rule (/a<hex> syntax) to grammar and metadata parser - sdks: regenerate Go (buf) and Python (betterproto) from updated protos
…sor wiring - Fix log-Mel spectrogram: rand_trunc path (1 view, shape B×1×1000×64), Slaney-normalised filterbank, f_min=50 Hz, nb_max_frames=1000 - Add AudioInput struct replacing raw waveform array - Fix batch_inference_text to omit attention_mask when model doesn't require it (ClapText ONNX only takes input_ids) - Handle 2D model outputs in postprocess_text_output (CLAP/CLIP projection encoders) - Add ORTPostprocessor::Audio variant; fix ORTTextPostprocessor to support ClapText - Add cross-modal retrieval integration test with real audio files
…Text max tokens Add two tests alongside the existing cross-modal one: audio queried against audio (identity retrieval) and text queried against text (identity retrieval). Refactored shared boilerplate into helpers. Also fix ClapText max_input_tokens from 77 to 512 — it uses a RoBERTa tokenizer, not CLIP's 77-token window.
Iamdavidonuh
approved these changes
Feb 18, 2026
# Conflicts: # ahnlich/ai/src/tests/buffalo_l_test.rs
56c3ab4 to
5bc63a4
Compare
- Resolve model numbering: SFACE_YUNET=8, CLAP_AUDIO=9, CLAP_TEXT=10 - Integrate model_params infrastructure from main - Combine Buffalo_L and SFaceYunet face recognition features - Regenerate and format all SDKs
Test Results270 tests 270 ✅ 11m 52s ⏱️ Results for commit 63c101a. ♻️ This comment has been updated with latest results. |
c427492 to
d79c446
Compare
Benchmark Results |
01afeee to
f11adcf
Compare
f11adcf to
63c101a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.