Skip to content

Taking a stab at the first audio models with CLAP#297

Merged
deven96 merged 19 commits intomainfrom
feat/clap-audio-embeddings
Feb 21, 2026
Merged

Taking a stab at the first audio models with CLAP#297
deven96 merged 19 commits intomainfrom
feat/clap-audio-embeddings

Conversation

@deven96
Copy link
Copy Markdown
Owner

@deven96 deven96 commented Feb 18, 2026

No description provided.

…rch support

Add Audio as a first-class input type across the full stack: proto, Rust, Go,
and Python SDKs. Introduce ClapAudio and ClapText model variants backed by
laion/larger_clap_music_and_speech ONNX weights, embedding audio and text into
a shared 512-dim space for audio+text multimodal similarity search.

- protos: add audio to StoreInput and MetadataValue oneofs; add CLAP_AUDIO,
  CLAP_TEXT to AIModel enum and AUDIO to AIStoreInputType
- types: wire Audio through MetadataValue serde (aud: prefix), StoreInput
  conversion, and AiStoreInputType mapping
- ai: add ModelType::Audio, ORTModality::Audio, ModelInput::Audios, and
  ORTAudioPreprocessor (symphonia decode → mono mix → rubato resample to
  48kHz → pad/truncate to 10s → batched ndarray)
- ai: add batch_inference_audio to SingleStageModel with L2 normalisation
- dsl: add audio rule (/a<hex> syntax) to grammar and metadata parser
- sdks: regenerate Go (buf) and Python (betterproto) from updated protos
…sor wiring

- Fix log-Mel spectrogram: rand_trunc path (1 view, shape B×1×1000×64),
  Slaney-normalised filterbank, f_min=50 Hz, nb_max_frames=1000
- Add AudioInput struct replacing raw waveform array
- Fix batch_inference_text to omit attention_mask when model doesn't require it
  (ClapText ONNX only takes input_ids)
- Handle 2D model outputs in postprocess_text_output (CLAP/CLIP projection encoders)
- Add ORTPostprocessor::Audio variant; fix ORTTextPostprocessor to support ClapText
- Add cross-modal retrieval integration test with real audio files
…Text max tokens

Add two tests alongside the existing cross-modal one: audio queried against audio
(identity retrieval) and text queried against text (identity retrieval). Refactored
shared boilerplate into helpers. Also fix ClapText max_input_tokens from 77 to 512 —
it uses a RoBERTa tokenizer, not CLIP's 77-token window.
@deven96 deven96 force-pushed the feat/buffalo-l-face-recognition branch 8 times, most recently from 56c3ab4 to 5bc63a4 Compare February 20, 2026 14:32
Base automatically changed from feat/buffalo-l-face-recognition to main February 20, 2026 14:45
- Resolve model numbering: SFACE_YUNET=8, CLAP_AUDIO=9, CLAP_TEXT=10
- Integrate model_params infrastructure from main
- Combine Buffalo_L and SFaceYunet face recognition features
- Regenerate and format all SDKs
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 20, 2026

Test Results

270 tests   270 ✅  11m 52s ⏱️
 35 suites    0 💤
  4 files      0 ❌

Results for commit 63c101a.

♻️ This comment has been updated with latest results.

@deven96 deven96 force-pushed the feat/clap-audio-embeddings branch 2 times, most recently from c427492 to d79c446 Compare February 20, 2026 15:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 20, 2026

Benchmark Results

group                                                        main                                   pr
-----                                                        ----                                   --
predicate_query_with_index/size_100                          1.00      3.1±0.00µs        ? ?/sec    1.02      3.2±0.00µs        ? ?/sec
predicate_query_with_index/size_1000                         1.10     33.1±0.03µs        ? ?/sec    1.00     30.0±0.03µs        ? ?/sec
predicate_query_with_index/size_10000                        1.00    392.3±0.36µs        ? ?/sec    1.01    397.9±0.45µs        ? ?/sec
predicate_query_with_index/size_100000                       1.00      5.4±0.14ms        ? ?/sec    1.03      5.5±0.19ms        ? ?/sec
predicate_query_without_index/size_100                       1.01      7.1±0.01µs        ? ?/sec    1.00      7.0±0.00µs        ? ?/sec
predicate_query_without_index/size_1000                      1.00     95.9±0.38µs        ? ?/sec    1.03     98.8±0.07µs        ? ?/sec
predicate_query_without_index/size_10000                     1.00    810.1±3.09µs        ? ?/sec    1.01    818.4±2.80µs        ? ?/sec
predicate_query_without_index/size_100000                    1.17     16.9±0.62ms        ? ?/sec    1.00     14.5±0.37ms        ? ?/sec
store_batch_insertion_without_predicates/size_100            1.05    237.7±7.11µs        ? ?/sec    1.00    227.3±1.95µs        ? ?/sec
store_batch_insertion_without_predicates/size_1000           1.02  1230.3±25.75µs        ? ?/sec    1.00   1209.3±8.76µs        ? ?/sec
store_batch_insertion_without_predicates/size_10000          1.01     13.3±0.12ms        ? ?/sec    1.00     13.2±0.13ms        ? ?/sec
store_batch_insertion_without_predicates/size_100000         1.00    130.4±0.66ms        ? ?/sec    1.00    130.6±0.74ms        ? ?/sec
store_retrieval_no_condition/size_100                        1.00    111.5±0.75µs        ? ?/sec    1.01    112.3±0.79µs        ? ?/sec
store_retrieval_no_condition/size_1000                       1.00   774.4±11.08µs        ? ?/sec    1.01    779.6±6.70µs        ? ?/sec
store_retrieval_no_condition/size_10000                      1.00      7.2±0.03ms        ? ?/sec    1.00      7.2±0.04ms        ? ?/sec
store_retrieval_no_condition/size_100000                     1.00     78.8±0.21ms        ? ?/sec    1.00     78.9±0.22ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100                   1.00    189.9±0.62µs        ? ?/sec    1.01    192.7±0.79µs        ? ?/sec
store_retrieval_non_linear_kdtree/size_1000                  1.00   1142.1±1.98µs        ? ?/sec    1.00   1145.9±1.76µs        ? ?/sec
store_retrieval_non_linear_kdtree/size_10000                 1.00     12.1±0.07ms        ? ?/sec    1.01     12.1±0.24ms        ? ?/sec
store_retrieval_non_linear_kdtree/size_100000                1.00    138.5±1.14ms        ? ?/sec    1.00    138.1±1.11ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100       1.00    273.1±0.61µs        ? ?/sec    1.01    275.3±1.00µs        ? ?/sec
store_sequential_insertion_without_predicates/size_1000      1.00      2.7±0.00ms        ? ?/sec    1.00      2.7±0.00ms        ? ?/sec
store_sequential_insertion_without_predicates/size_10000     1.00     26.9±0.13ms        ? ?/sec    1.01     27.1±0.01ms        ? ?/sec
store_sequential_insertion_without_predicates/size_100000    1.00    268.0±0.45ms        ? ?/sec    1.01    271.0±0.70ms        ? ?/sec

@deven96 deven96 force-pushed the feat/clap-audio-embeddings branch 5 times, most recently from 01afeee to f11adcf Compare February 21, 2026 03:31
@deven96 deven96 force-pushed the feat/clap-audio-embeddings branch from f11adcf to 63c101a Compare February 21, 2026 03:34
@deven96 deven96 merged commit 115767f into main Feb 21, 2026
7 checks passed
@deven96 deven96 deleted the feat/clap-audio-embeddings branch February 21, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants