Add in-memory model loading API by jolestar · Pull Request #35 · MinishLab/model2vec-rs

jolestar · 2026-03-25T07:00:35Z

This PR adds a small, self-contained loading improvement that makes Model2Vec easier to embed in runtimes that already fetch model assets as bytes.

What changes

adds StaticModel::from_bytes(...) for loading a model from in-memory tokenizer.json, model.safetensors, and config.json bytes
extracts resolve_model_files(...) so path / Hugging Face file resolution is reusable
rewrites from_pretrained(...) to reuse the same byte-loading path instead of duplicating model parsing logic
adds a regression test that verifies from_bytes(...) matches local from_pretrained(...)

Why

This keeps the existing API intact, but makes it possible to integrate model2vec-rs into environments where model assets are fetched or embedded outside a local filesystem-oriented flow.

I kept this PR intentionally narrow and did not include any feature-flag or wasm-specific dependency changes yet.

Validation

cargo test

Copilot

Pull request overview

This PR adds an in-memory loading entrypoint for StaticModel and refactors the existing from_pretrained path to reuse a shared byte-parsing implementation, improving embeddability in environments that already provide model assets as bytes.

Changes:

Added StaticModel::from_bytes(...) to construct a model from in-memory tokenizer.json, model.safetensors, and config.json bytes.
Extracted resolve_model_files(...) to centralize local-path vs Hugging Face Hub file resolution and updated from_pretrained(...) to reuse it.
Added a regression test ensuring from_bytes(...) matches from_pretrained(...) for a local fixture model.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
tests/test_model.rs	Adds a regression test comparing embeddings produced by byte-loaded vs path-loaded models.
src/model.rs	Introduces `from_bytes`, refactors `from_pretrained` to read bytes and delegate, and adds reusable model-file resolution helper/types.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T07:03:44Z

tests/test_model.rs

 mod common;
 use common::load_test_model;
 use model2vec_rs::model::StaticModel;
+use std::fs;


A module-level use std::fs; was added, but test_from_borrowed already imports std::fs inside the function. This duplication is easy to miss and can trigger lint noise; consider removing the inner import and using the module-level one consistently (or vice versa).

Copilot · 2026-03-25T07:03:44Z

src/model.rs

+pub struct ModelFiles {
+    pub tokenizer: std::path::PathBuf,
+    pub model: std::path::PathBuf,
+    pub config: std::path::PathBuf,
+}
+


ModelFiles is a new public struct but it has no rustdoc and exposes its fields as pub, which locks in the exact file naming/layout as part of the public API. Consider making this pub(crate) (if only intended for internal reuse) or adding rustdoc + keeping fields private (with accessors) to preserve API flexibility.

Suggested change

pub struct ModelFiles {

pub tokenizer: std::path::PathBuf,

pub model: std::path::PathBuf,

pub config: std::path::PathBuf,

}

/// Resolved file paths for a Model2Vec model.

///

/// This struct groups together the paths to the tokenizer, model weights, and

/// configuration files, regardless of whether they were loaded from a local

/// directory or downloaded from the Hugging Face Hub.

pub struct ModelFiles {

tokenizer: std::path::PathBuf,

model: std::path::PathBuf,

config: std::path::PathBuf,

}

impl ModelFiles {

/// Returns the path to the tokenizer JSON file.

pub fn tokenizer(&self) -> &std::path::Path {

&self.tokenizer

}

/// Returns the path to the model weights file (in safetensors format).

pub fn model(&self) -> &std::path::Path {

&self.model

}

/// Returns the path to the model configuration JSON file.

pub fn config(&self) -> &std::path::Path {

&self.config

}

}

Copilot · 2026-03-25T07:03:44Z

src/model.rs

+pub fn resolve_model_files<P: AsRef<Path>>(
+    repo_or_path: P,
+    token: Option<&str>,
+    subfolder: Option<&str>,
+) -> Result<ModelFiles> {
+    if let Some(tok) = token {
+        env::set_var("HF_HUB_TOKEN", tok);
+    }


resolve_model_files is now pub and has a global side effect (env::set_var("HF_HUB_TOKEN", ...)). As a reusable API this can create surprising cross-call behavior (e.g., concurrent callers, different tokens, test isolation). Prefer constructing the hf-hub client with the token (e.g., an ApiBuilder/token option) or keep the env-var mutation confined to from_pretrained and make this helper pub(crate).

jolestar · 2026-03-25T09:11:55Z

Followed up on the initial review pass: the helper pieces introduced in this PR are now kept internal, and I also cleaned up the duplicated test import. The PR remains intentionally narrow: it adds in-memory model loading plus the loading-path refactor, and I re-ran cargo test after the cleanup.

Pringled

Hey @jolestar, thanks for adding this, LGTM. CI is failing, can you run clippy to fix the formatting? After that it's good to go.

Copilot AI review requested due to automatic review settings March 25, 2026 07:00

Copilot started reviewing on behalf of jolestar March 25, 2026 07:01 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Add in-memory model loading API

25272fb

jolestar force-pushed the add-from-bytes-loading branch from b6f7781 to 25272fb Compare March 25, 2026 08:47

Pringled requested changes Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-memory model loading API#35

Add in-memory model loading API#35
jolestar wants to merge 1 commit intoMinishLab:mainfrom
jolestar:add-from-bytes-loading

jolestar commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

jolestar commented Mar 25, 2026

Uh oh!

Pringled left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-pub struct ModelFiles {
-    pub tokenizer: std::path::PathBuf,
-    pub model: std::path::PathBuf,
-    pub config: std::path::PathBuf,
-}
+/// Resolved file paths for a Model2Vec model.
+///
+/// This struct groups together the paths to the tokenizer, model weights, and
+/// configuration files, regardless of whether they were loaded from a local
+/// directory or downloaded from the Hugging Face Hub.
+pub struct ModelFiles {
+    tokenizer: std::path::PathBuf,
+    model: std::path::PathBuf,
+    config: std::path::PathBuf,
+}
+impl ModelFiles {
+    /// Returns the path to the tokenizer JSON file.
+    pub fn tokenizer(&self) -> &std::path::Path {
+        &self.tokenizer
+    }
+    /// Returns the path to the model weights file (in safetensors format).
+    pub fn model(&self) -> &std::path::Path {
+        &self.model
+    }
+    /// Returns the path to the model configuration JSON file.
+    pub fn config(&self) -> &std::path::Path {
+        &self.config
+    }
+}

Conversation

jolestar commented Mar 25, 2026

What changes

Why

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

jolestar commented Mar 25, 2026

Uh oh!

Pringled left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants