Skip to content

[Feature] Enable safe per-request LoRA loading (implementation included) #1143

@MateusGPe

Description

@MateusGPe

Feature Summary

Support for safe, per-request LoRA loading when running in HTTP Server mode

Detailed Description

I would like to propose a patch to enable per-request LoRA loading when running in HTTP Server. Currently, the server ignores LoRA tags (e.g., lora:name:0.5) in prompts because the generation parameters are initialized with an empty string for the LoRA directory.

I am unsure of the original motivation for explicitly disabling this path, but simply enabling it causes "weight stacking" (state pollution) on the persistent server process.

The attached diff implements the following:

  1. Enables Parsing: Updates gen_params.process_and_check to use ctx_params.lora_model_dir, allowing the backend to find and load the LoRA files.
  2. Enables Safety: Forces ctx_params.lora_apply_mode = LORA_APPLY_AT_RUNTIME during server initialization. This ensures LoRA calculations are applied dynamically during graph execution without permanently altering the base model weights.

Proposed Change

Diff

diff --git a/examples/server/main.cpp b/examples/server/main.cpp
index 5c951c0..4fb57b0 100644
--- a/examples/server/main.cpp
+++ b/examples/server/main.cpp
@@ -282,6 +282,7 @@ int main(int argc, const char** argv) {
     LOG_DEBUG("%s", default_gen_params.to_string().c_str());
 
     sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(false, false, false);
+    ctx_params.lora_apply_mode    = LORA_APPLY_AT_RUNTIME;
     sd_ctx_t* sd_ctx              = new_sd_ctx(&sd_ctx_params);
 
     if (sd_ctx == nullptr) {
@@ -392,7 +393,7 @@ int main(int argc, const char** argv) {
                 return;
             }
 
-            if (!gen_params.process_and_check(IMG_GEN, "")) {
+            if (!gen_params.process_and_check(IMG_GEN, ctx_params.lora_model_dir)) {
                 res.status = 400;
                 res.set_content(R"({"error":"invalid params"})", "application/json");
                 return;
@@ -570,7 +571,7 @@ int main(int argc, const char** argv) {
                 return;
             }
 
-            if (!gen_params.process_and_check(IMG_GEN, "")) {
+            if (!gen_params.process_and_check(IMG_GEN, ctx_params.lora_model_dir)) {
                 res.status = 400;
                 res.set_content(R"({"error":"invalid params"})", "application/json");
                 return;

Alternatives you considered

  • Naive Implementation: Just passing the directory without changing the apply mode. This resulted in model corruption after a few requests due to weight accumulation.
  • External Management: Running separate server instances for different LoRA configurations, which is resource-heavy and lacks flexibility.

Additional context

  • File: examples/server/main.cpp
  • Logic: common.hpp (Lines ~440) describes at_runtime as the method that avoids precision issues and permanent weight modification, making it suitable for a long-running server process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions