Skip to content

Generalize make_txt_embedding{,_json}.py#56

Open
NetZissou wants to merge 3 commits into
mainfrom
update-txt-emb
Open

Generalize make_txt_embedding{,_json}.py#56
NetZissou wants to merge 3 commits into
mainfrom
update-txt-emb

Conversation

@NetZissou
Copy link
Copy Markdown
Contributor

make_txt_embedding.py:

  • Add PRESETS dict (bioclip-2, bioclip-2.5-vith14) and --preset CLI flag
  • Add --model / --tokenizer / --embed-dim CLI flags for arbitrary models (e.g. BioCAP, future BioCLIP releases)
  • Replace hardcoded model_str/tokenizer_str/768 with parameterized values
  • Add Usage / Examples block to the module docstring

make_txt_embedding_json.py:

  • Add drop_corrupted_rows() that removes rows whose any taxonomic rank matches an ISO-8601 timestamp or the literal string 'true' / 'false'
  • Filter is on by default; --no-corruption-filter reproduces the pre-existing upstream behavior used for BioCLIP 2 taxon JSON generation

`make_txt_embedding.py:`
- Add PRESETS dict (bioclip-2, bioclip-2.5-vith14) and --preset CLI flag
- Add --model / --tokenizer / --embed-dim CLI flags for arbitrary models
  (e.g. BioCAP, future BioCLIP releases)
- Replace hardcoded model_str/tokenizer_str/768 with parameterized values
- Add Usage / Examples block to the module docstring

`make_txt_embedding_json.py`:
-  Add drop_corrupted_rows() that removes rows whose any taxonomic rank matches an ISO-8601 timestamp or the literal string 'true' / 'false'
- Filter is on by default; --no-corruption-filter reproduces the
  pre-existing upstream behavior used for BioCLIP 2 taxon JSON
generation

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@NetZissou NetZissou requested a review from egrace479 May 29, 2026 14:10
@NetZissou NetZissou self-assigned this May 29, 2026
Copy link
Copy Markdown
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple consistency notes.

I think we may want the default behavior to be no-corruption-filter since that wouldn't change existing outputs; it's probably better not to 'quietly' do extra filtering. We should watch for these items in future catalog iterations.

The newest catalog is the only one with the ISO and boolean issues, so this change will not impact any existing text embeddings. As such, the default to remove such entries is fine, especially since it print a statement about their removal.

Comment thread processing/scripts/make_txt_embedding.py Outdated
Comment thread processing/scripts/make_txt_embedding.py
Co-Authored-By: Elizabeth Campolongo <egrace479@users.noreply.github.com>
@NetZissou NetZissou requested a review from egrace479 May 29, 2026 17:40
@NetZissou
Copy link
Copy Markdown
Contributor Author

@egrace479 reverted to ab54735

Comment thread processing/scripts/make_txt_embedding.py Outdated
Co-authored-by: Elizabeth Campolongo <38985481+egrace479@users.noreply.github.com>
Copy link
Copy Markdown
Member

@egrace479 egrace479 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants