Skip to content

[Stack 6/17] Fix D2: in-conv participant threshold + D2c vote count source#2513

Open
jucor wants to merge 1 commit intospr/edge/bdc830dbfrom
spr/edge/c0a682ec
Open

[Stack 6/17] Fix D2: in-conv participant threshold + D2c vote count source#2513
jucor wants to merge 1 commit intospr/edge/bdc830dbfrom
spr/edge/c0a682ec

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 30, 2026

Summary

Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).

D2: In-conv threshold

  • Before: threshold = 7 + sqrt(n_cmts) * 0.1 — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
  • After: threshold = min(7, n_cmts) — matches Clojure exactly

D2b: Base-cluster sort order (from Copilot review)

  • Before: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
  • After: Keep k-means ID order, matching Clojure's (sort-by :id ...)

D2c: Vote count source (raw vs filtered matrix)

  • Before: _compute_user_vote_counts and n_cmts used self.rating_mat (filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.
  • After: Both use self.raw_rating_mat (includes all votes, even on moderated-out comments), matching Clojure's user-vote-counts (conversation.clj:217-225) which reads from raw-rating-mat.

D2d: In-conv monotonicity (design decision)

Python does full recompute from raw_rating_mat every time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is strictly better than Clojure's approach (which persists in-conv to math_main because it uses delta vote processing).

5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).

Impact

  • biodiversity: 428 → 441 in-conv participants (now matches Clojure)
  • Verified on 4 datasets with complete Clojure cold-start blobs

Incremental vs cold-start blob testing

D2 tests run against both cold-start and incremental Clojure blobs (infrastructure from #2420):

  • Cold-start blobs are computed in one pass on the full dataset. The in-conv threshold min(7, n_cmts) is evaluated once with the final n_cmts. Python matches these exactly.
  • Incremental blobs were built progressively as votes trickled in over the conversation's lifetime. The threshold was evaluated at each iteration with a smaller n_cmts, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).

D2 tests on incremental blobs are currently xfailed with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.

Test results

253 passed, 5 skipped, 36 xfailed (0 failures)

Test plan

  • D2 tests pass on all datasets with complete Clojure cold-start blobs
  • D2c: 3 synthetic tests verify vote counts include moderated-out votes, n_cmts includes moderated-out comments, participants stay in-conv after moderation
  • D2d: 5 monotonicity tests (basic across updates, survives moderation, worker restart + moderation, restart without new votes, mixed participants)
  • D2 tests xfail on incremental blobs (with explanatory comments)
  • Full test suite: 253 passed, 0 failures
  • Golden snapshots re-recorded for affected datasets

🤖 Generated with Claude Code

Squashed commits

  • Fix D2: in-conv threshold min(7, n_cmts) to match Clojure
  • Skip D2 tests on datasets with incomplete Clojure blobs
  • Address Copilot review: fix base-cluster sort order (D2b) and stale comment
  • Add PR 1 test results to journal
  • Plan: add D2c (vote count source) and D2d (in-conv monotonicity) to fix plan
  • Journal: add session 3 findings (D2c vote count source, D2d monotonicity)
  • Re-record golden snapshots and remove passing xfail markers
  • xfail D2 in-conv tests on incremental blobs
  • Journal: add session 4, update plan with D2 incremental in Replay PR B
  • Fix D2c: use raw_rating_mat for vote counts and n_cmts threshold

commit-id:c0a682ec


Stack:


⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

@jucor jucor changed the title Fix D2: in-conv participant threshold + D2c vote count source [Stack 6/17] Fix D2: in-conv participant threshold + D2c vote count source Mar 30, 2026
@jucor jucor force-pushed the spr/edge/c0a682ec branch 2 times, most recently from 43304ef to 02284b0 Compare March 30, 2026 22:47
## Summary


Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).

### D2: In-conv threshold

- **Before**: `threshold = 7 + sqrt(n_cmts) * 0.1` — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
- **After**: `threshold = min(7, n_cmts)` — matches Clojure exactly

### D2b: Base-cluster sort order (from Copilot review)

- **Before**: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
- **After**: Keep k-means ID order, matching Clojure's `(sort-by :id ...)`

### D2c: Vote count source (raw vs filtered matrix)

- **Before**: `_compute_user_vote_counts` and `n_cmts` used `self.rating_mat` (filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.
- **After**: Both use `self.raw_rating_mat` (includes all votes, even on moderated-out comments), matching Clojure's `user-vote-counts` (conversation.clj:217-225) which reads from `raw-rating-mat`.

### D2d: In-conv monotonicity (design decision)

Python does full recompute from `raw_rating_mat` every time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is **strictly better** than Clojure's approach (which persists in-conv to `math_main` because it uses delta vote processing).

5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).

### Impact

- biodiversity: 428 → 441 in-conv participants (now matches Clojure)
- Verified on 4 datasets with complete Clojure cold-start blobs

### Incremental vs cold-start blob testing

D2 tests run against both **cold-start** and **incremental** Clojure blobs (infrastructure from #2420):

- **Cold-start blobs** are computed in one pass on the full dataset. The in-conv threshold `min(7, n_cmts)` is evaluated once with the final `n_cmts`. Python matches these exactly.
- **Incremental blobs** were built progressively as votes trickled in over the conversation's lifetime. The threshold was evaluated at each iteration with a smaller `n_cmts`, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).

D2 tests on incremental blobs are currently **xfailed** with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.

### Test results

```
253 passed, 5 skipped, 36 xfailed (0 failures)
```

## Test plan

- [x] D2 tests pass on all datasets with complete Clojure cold-start blobs
- [x] D2c: 3 synthetic tests verify vote counts include moderated-out votes, n_cmts includes moderated-out comments, participants stay in-conv after moderation
- [x] D2d: 5 monotonicity tests (basic across updates, survives moderation, worker restart + moderation, restart without new votes, mixed participants)
- [x] D2 tests xfail on incremental blobs (with explanatory comments)
- [x] Full test suite: 253 passed, 0 failures
- [x] Golden snapshots re-recorded for affected datasets

🤖 Generated with [Claude Code](https://claude.com/claude-code)


## Squashed commits

- Fix D2: in-conv threshold min(7, n_cmts) to match Clojure
- Skip D2 tests on datasets with incomplete Clojure blobs
- Address Copilot review: fix base-cluster sort order (D2b) and stale comment
- Add PR 1 test results to journal
- Plan: add D2c (vote count source) and D2d (in-conv monotonicity) to fix plan
- Journal: add session 3 findings (D2c vote count source, D2d monotonicity)
- Re-record golden snapshots and remove passing xfail markers
- xfail D2 in-conv tests on incremental blobs
- Journal: add session 4, update plan with D2 incremental in Replay PR B
- Fix D2c: use raw_rating_mat for vote counts and n_cmts threshold

commit-id:c0a682ec
@jucor jucor force-pushed the spr/edge/c0a682ec branch from 02284b0 to 7f20a34 Compare March 31, 2026 00:35
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 2 0 100%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 1 0 100%
components/config.py 165 133 19%
conversation/init.py 2 0 100%
conversation/conversation.py 1117 328 71%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 234 40%
database/postgres.py 305 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 257 22 91%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 52 16 69%
pca_kmeans_rep/repness.py 361 47 87%
pca_kmeans_rep/stats.py 107 22 79%
regression/init.py 4 0 100%
regression/clojure_comparer.py 188 17 91%
regression/comparer.py 887 720 19%
regression/datasets.py 135 27 80%
regression/recorder.py 36 27 25%
regression/utils.py 137 118 14%
run_math_pipeline.py 260 114 56%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 785 785 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 108 108 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 584 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 62 41 34%
Total 10950 7643 30%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant