The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"
+ ngpus=1
+ fsize=30000
+ zsize=60000
+ max_sequence_length=256
+ max_tokens_per_batch=512
+ llm_model_name=huggyllama/llama-7b
+ ssm_model_name=JackFram/llama-68m
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 16 -ll:util 16 -ll:gpu 1 -ll:fsize 30000 -ll:zsize 60000 -llm-model huggyllama/llama-7b -ssm-model JackFram/llama-68m -prompt ./FlexFlow/inference/prompt/chatgpt_32.json --verbose --max-requests-per-batch 32 --max-sequence-length 256 --max-tokens-per-batch 512 -tensor-parallelism-degree 1 --fusion -output-file ./FlexFlow/inference/output/server_small-32_batchsize-tree_specinfer_tree_16core.txt
Applying fusion optimizations during compilation...
424 operators before fusion...
198 operators after fusion...
Applying fusion optimizations during compilation...
35 operators before fusion...
18 operators after fusion...
*** stack smashing detected ***: terminated
./server_gpu_experiments.sh: line 31: 1088568 Aborted (core dumped) ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --verbose --max-requests-per-batch $bs --max-sequence-length $max_sequence_length --max-tokens-per-batch $max_tokens_per_batch -tensor-parallelism-degree $ngpus --fusion -output-file ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.txt > ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.ou
when I set the number of cpu cores to 1, it will stuck.
Probably at here ./Flexflow/src/runtime/request_manager.cc::283:
if (get_num_ssms() == 0) {
xxx
} else {
std::cout << "Num of SSMs: " << get_num_ssms() << std::endl;
for (int i = 0; i < get_num_ssms(); i++) {
BeamTree beam_tree = BeamTree{};
request.beam_trees.push_back(beam_tree);
}
}
pending_request_queue.push(request);
all_requests[request.guid] = request;
{
const std::lock_guard<std::mutex> lock(request_to_promise_mutex);
request_to_promise[request.guid] = new std::promise<void>();
}
{
std::string output = "New request tokens:";
output = "[" + std::to_string(request.guid) + "]" + output;
for (int i = 0; i < request.tokens.size(); i++) {
output = output + " " + std::to_string(request.tokens[i]);
}
log_req_mgr.print("%s", output.c_str());
}
below is the log:
[0 - 7efdb03fc000] 1.025782 {3}{RequestManager}: [1011486]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889
[0]14350
[1]263
[2]26228
[3]21256
[4]1048
[5]7535
[6]17770
[7]363
[8]596
[9]10462
[10]29889
Num of SSMs: 1
stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."
The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"
when I set the number of cpu cores to 1, it will stuck.
Probably at here ./Flexflow/src/runtime/request_manager.cc::283:
below is the log:
stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."