Improve speaker identification using autoresearch/evo approach

## Problem

Current speaker identification relies on:
1. **Cosine similarity** between speaker embeddings and stored profiles (`scoreSpeakerProfile` in `store.ts`)
2. **Hard-coded thresholds** — 0.7 (≥3 samples) or 0.74 (<3 samples) minimum score, plus a 0.04 margin gate when score < 0.82
3. **Simple centroid + best-sample scoring** — `Math.max(bestSampleScore, centroidScore)`

This works but the thresholds and scoring strategy were hand-tuned. There is no benchmark to measure how well speaker ID actually performs, and no systematic way to improve it.

## Approach: Autoresearch loop (Evo-style)

Use [evo](https://github.com/evo-hq/evo) or its approach — an autoresearch loop that discovers a benchmark, runs baseline, then spawns parallel agents to beat it:

- **Tree search over greedy hill-climb** — multiple forks from any committed improvement
- **N parallel agents in git worktrees** — each tries a different hypothesis
- **Shared failure traces** — agents don't repeat each other's mistakes
- **Regression gates** — changes that break existing correct matches get discarded

## What to measure

Build a benchmark dataset of meetings with ground-truth speaker labels. Metrics:

- **Accuracy** — % of speakers correctly identified against known profiles
- **Precision/Recall** — false matches vs missed matches
- **Confidence calibration** — does a 0.85 confidence actually mean 85% correct?
- **Threshold sensitivity** — how much do results change with threshold tweaks?

## What to explore

The autoresearch loop should explore improvements across the full stack:

### Scoring strategy (`store.ts`)
- Weighted combination of centroid + sample scores instead of simple `Math.max`
- Top-K sample averaging instead of single best sample
- Score normalization across profiles (relative ranking vs absolute threshold)
- Adaptive thresholds based on profile quality (sample count, embedding variance)

### Embedding quality (Swift layer)
- Segment selection strategy — which diarization segments to embed (currently all)
- Minimum segment duration filtering
- Embedding aggregation — mean vs weighted mean vs attention-pooled centroids
- Per-sample quality scoring (reject noisy/short segments)

### Profile management
- Automatic outlier detection in stored samples
- Profile convergence metrics — when does a profile have "enough" samples?
- Cross-meeting consistency checks

### Matching logic
- Two-stage matching: fast centroid screen → detailed sample comparison
- Speaker verification (1:1) vs identification (1:N) distinction
- Temporal priors — if speaker A was in the last 3 meetings, they're likely in this one

## Current architecture reference

```
Diarization → Segments with speaker IDs (Swift/CoreML sortformer)
     ↓
Embedding extraction → Per-speaker embedding vectors (Swift speech_bridge)
     ↓
Profile matching → cosine similarity against stored profiles (TS store.ts)
     ↓
Suggestion → recommendSpeakerProfile() returns best match above threshold
```

Key files:
- `src/store.ts` — `cosineSimilarity()`, `scoreSpeakerProfile()`, `recommendSpeakerProfile()`, `normalizedEmbeddingCentroid()`
- `src-tauri/swift-permissions/src/speech_bridge.swift` — embedding extraction, diarization segment processing, centroid computation
- `src-tauri/src/lib.rs` — `StoredSpeakerProfile`, `analyze_speaker_embeddings` command
- `src-tauri/src/asr.rs` — `SpeakerEmbeddingPayload`, `FileSpeakerEmbeddingPayload`

## Implementation plan

1. **Build benchmark harness** — collect ground-truth labeled meetings, define metrics, run baseline
2. **Set up evo** — point it at the speaker ID codebase, configure benchmark as the optimization target
3. **Run optimization loop** — let parallel agents explore scoring, thresholds, embedding strategies
4. **Gate on regression** — any change must not regress existing correct matches
5. **Ship the winner** — commit the best-performing configuration


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speaker identification using autoresearch/evo approach #1

Problem

Approach: Autoresearch loop (Evo-style)

What to measure

What to explore

Scoring strategy (`store.ts`)

Embedding quality (Swift layer)

Profile management

Matching logic

Current architecture reference

Implementation plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve speaker identification using autoresearch/evo approach #1

Description

Problem

Approach: Autoresearch loop (Evo-style)

What to measure

What to explore

Scoring strategy (store.ts)

Embedding quality (Swift layer)

Profile management

Matching logic

Current architecture reference

Implementation plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Scoring strategy (`store.ts`)