You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2026. It is now read-only.
This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.
Approach: Autoresearch loop (Evo-style)
Use evo or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:
Tree search over greedy hill-climb — multiple forks from any committed improvement
N parallel agents in git worktrees — each tries a different hypothesis
Shared failure traces — agents don't repeat each other's mistakes
Regression gates — changes that break existing correct matches get discarded
What to measure
Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:
Accuracy — % of speakers correctly identified against known profiles
Precision/Recall — false matches vs missed matches
Confidence calibration — does a 0.85 confidence actually mean 85% correct?
Threshold sensitivity — how much do results change with threshold tweaks?
What to explore
The autoresearch loop should explore improvements across the full stack:
Scoring strategy (store.ts)
Weighted combination of centroid + sample scores instead of simple Math.max
Top-K sample averaging instead of single best sample
Score normalization across profiles (relative ranking vs absolute threshold)
Adaptive thresholds based on profile quality (sample count, embedding variance)
Embedding quality (Swift layer)
Segment selection strategy — which diarization segments to embed (currently all)
Minimum segment duration filtering
Embedding aggregation — mean vs weighted mean vs attention-pooled centroids
Problem
Current speaker identification relies on:
scoreSpeakerProfileinstore.ts)Math.max(bestSampleScore, centroidScore)This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.
Approach: Autoresearch loop (Evo-style)
Use evo or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:
What to measure
Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:
What to explore
The autoresearch loop should explore improvements across the full stack:
Scoring strategy (
store.ts)Math.maxEmbedding quality (Swift layer)
Profile management
Matching logic
Current architecture reference
Key files:
src/store.ts—cosineSimilarity(),scoreSpeakerProfile(),recommendSpeakerProfile(),normalizedEmbeddingCentroid()src-tauri/swift-permissions/src/speech_bridge.swift— embedding extraction, diarization segment processing, centroid computationsrc-tauri/src/lib.rs—StoredSpeakerProfile,analyze_speaker_embeddingscommandsrc-tauri/src/asr.rs—SpeakerEmbeddingPayload,FileSpeakerEmbeddingPayloadImplementation plan