Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an in-memory loading entrypoint for StaticModel and refactors the existing from_pretrained path to reuse a shared byte-parsing implementation, improving embeddability in environments that already provide model assets as bytes.
Changes:
- Added
StaticModel::from_bytes(...)to construct a model from in-memorytokenizer.json,model.safetensors, andconfig.jsonbytes. - Extracted
resolve_model_files(...)to centralize local-path vs Hugging Face Hub file resolution and updatedfrom_pretrained(...)to reuse it. - Added a regression test ensuring
from_bytes(...)matchesfrom_pretrained(...)for a local fixture model.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_model.rs | Adds a regression test comparing embeddings produced by byte-loaded vs path-loaded models. |
| src/model.rs | Introduces from_bytes, refactors from_pretrained to read bytes and delegate, and adds reusable model-file resolution helper/types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mod common; | ||
| use common::load_test_model; | ||
| use model2vec_rs::model::StaticModel; | ||
| use std::fs; |
There was a problem hiding this comment.
A module-level use std::fs; was added, but test_from_borrowed already imports std::fs inside the function. This duplication is easy to miss and can trigger lint noise; consider removing the inner import and using the module-level one consistently (or vice versa).
src/model.rs
Outdated
| pub struct ModelFiles { | ||
| pub tokenizer: std::path::PathBuf, | ||
| pub model: std::path::PathBuf, | ||
| pub config: std::path::PathBuf, | ||
| } | ||
|
|
There was a problem hiding this comment.
ModelFiles is a new public struct but it has no rustdoc and exposes its fields as pub, which locks in the exact file naming/layout as part of the public API. Consider making this pub(crate) (if only intended for internal reuse) or adding rustdoc + keeping fields private (with accessors) to preserve API flexibility.
| pub struct ModelFiles { | |
| pub tokenizer: std::path::PathBuf, | |
| pub model: std::path::PathBuf, | |
| pub config: std::path::PathBuf, | |
| } | |
| /// Resolved file paths for a Model2Vec model. | |
| /// | |
| /// This struct groups together the paths to the tokenizer, model weights, and | |
| /// configuration files, regardless of whether they were loaded from a local | |
| /// directory or downloaded from the Hugging Face Hub. | |
| pub struct ModelFiles { | |
| tokenizer: std::path::PathBuf, | |
| model: std::path::PathBuf, | |
| config: std::path::PathBuf, | |
| } | |
| impl ModelFiles { | |
| /// Returns the path to the tokenizer JSON file. | |
| pub fn tokenizer(&self) -> &std::path::Path { | |
| &self.tokenizer | |
| } | |
| /// Returns the path to the model weights file (in safetensors format). | |
| pub fn model(&self) -> &std::path::Path { | |
| &self.model | |
| } | |
| /// Returns the path to the model configuration JSON file. | |
| pub fn config(&self) -> &std::path::Path { | |
| &self.config | |
| } | |
| } |
src/model.rs
Outdated
| pub fn resolve_model_files<P: AsRef<Path>>( | ||
| repo_or_path: P, | ||
| token: Option<&str>, | ||
| subfolder: Option<&str>, | ||
| ) -> Result<ModelFiles> { | ||
| if let Some(tok) = token { | ||
| env::set_var("HF_HUB_TOKEN", tok); | ||
| } |
There was a problem hiding this comment.
resolve_model_files is now pub and has a global side effect (env::set_var("HF_HUB_TOKEN", ...)). As a reusable API this can create surprising cross-call behavior (e.g., concurrent callers, different tokens, test isolation). Prefer constructing the hf-hub client with the token (e.g., an ApiBuilder/token option) or keep the env-var mutation confined to from_pretrained and make this helper pub(crate).
b6f7781 to
25272fb
Compare
|
Followed up on the initial review pass: the helper pieces introduced in this PR are now kept internal, and I also cleaned up the duplicated test import. The PR remains intentionally narrow: it adds in-memory model loading plus the loading-path refactor, and I re-ran cargo test after the cleanup. |
This PR adds a small, self-contained loading improvement that makes Model2Vec easier to embed in runtimes that already fetch model assets as bytes.
What changes
StaticModel::from_bytes(...)for loading a model from in-memorytokenizer.json,model.safetensors, andconfig.jsonbytesresolve_model_files(...)so path / Hugging Face file resolution is reusablefrom_pretrained(...)to reuse the same byte-loading path instead of duplicating model parsing logicfrom_bytes(...)matches localfrom_pretrained(...)Why
This keeps the existing API intact, but makes it possible to integrate
model2vec-rsinto environments where model assets are fetched or embedded outside a local filesystem-oriented flow.I kept this PR intentionally narrow and did not include any feature-flag or wasm-specific dependency changes yet.
Validation
cargo test