Skip to content

Error when I use larger batch size for spec-infer #6

@lhr-30

Description

@lhr-30

The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"

+ ngpus=1
+ fsize=30000
+ zsize=60000
+ max_sequence_length=256
+ max_tokens_per_batch=512
+ llm_model_name=huggyllama/llama-7b
+ ssm_model_name=JackFram/llama-68m
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 16 -ll:util 16 -ll:gpu 1 -ll:fsize 30000 -ll:zsize 60000 -llm-model huggyllama/llama-7b -ssm-model JackFram/llama-68m -prompt ./FlexFlow/inference/prompt/chatgpt_32.json --verbose --max-requests-per-batch 32 --max-sequence-length 256 --max-tokens-per-batch 512 -tensor-parallelism-degree 1 --fusion -output-file ./FlexFlow/inference/output/server_small-32_batchsize-tree_specinfer_tree_16core.txt
Applying fusion optimizations during compilation...
424 operators before fusion...
198 operators after fusion...
Applying fusion optimizations during compilation...
35 operators before fusion...
18 operators after fusion...
*** stack smashing detected ***: terminated
./server_gpu_experiments.sh: line 31: 1088568 Aborted                 (core dumped) ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --verbose --max-requests-per-batch $bs --max-sequence-length $max_sequence_length --max-tokens-per-batch $max_tokens_per_batch -tensor-parallelism-degree $ngpus --fusion -output-file ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.txt > ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.ou

when I set the number of cpu cores to 1, it will stuck.
Probably at here ./Flexflow/src/runtime/request_manager.cc::283:

if (get_num_ssms() == 0) {
    xxx
  } else {
    std::cout << "Num of SSMs: " << get_num_ssms() << std::endl;
    for (int i = 0; i < get_num_ssms(); i++) {
      BeamTree beam_tree = BeamTree{};
      request.beam_trees.push_back(beam_tree);
    }
  }

  pending_request_queue.push(request);
  all_requests[request.guid] = request;
  {
    const std::lock_guard<std::mutex> lock(request_to_promise_mutex);
    request_to_promise[request.guid] = new std::promise<void>();
  }

  {
    std::string output = "New request tokens:";
    output = "[" + std::to_string(request.guid) + "]" + output;
    for (int i = 0; i < request.tokens.size(); i++) {
      output = output + " " + std::to_string(request.tokens[i]);
    }
    log_req_mgr.print("%s", output.c_str());
  }

below is the log:

[0 - 7efdb03fc000]    1.025782 {3}{RequestManager}: [1011486]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889
[0]14350
[1]263
[2]26228
[3]21256
[4]1048
[5]7535
[6]17770
[7]363
[8]596
[9]10462
[10]29889
Num of SSMs: 1

stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions