Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when I use larger batch size for spec-infer #1491

Open
lhr-30 opened this issue Sep 6, 2024 · 0 comments
Open

Error when I use larger batch size for spec-infer #1491

lhr-30 opened this issue Sep 6, 2024 · 0 comments

Comments

@lhr-30
Copy link

lhr-30 commented Sep 6, 2024

The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"

+ ngpus=1
+ fsize=30000
+ zsize=60000
+ max_sequence_length=256
+ max_tokens_per_batch=512
+ llm_model_name=huggyllama/llama-7b
+ ssm_model_name=JackFram/llama-68m
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 16 -ll:util 16 -ll:gpu 1 -ll:fsize 30000 -ll:zsize 60000 -llm-model huggyllama/llama-7b -ssm-model JackFram/llama-68m -prompt ./FlexFlow/inference/prompt/chatgpt_32.json --verbose --max-requests-per-batch 32 --max-sequence-length 256 --max-tokens-per-batch 512 -tensor-parallelism-degree 1 --fusion -output-file ./FlexFlow/inference/output/server_small-32_batchsize-tree_specinfer_tree_16core.txt
Applying fusion optimizations during compilation...
424 operators before fusion...
198 operators after fusion...
Applying fusion optimizations during compilation...
35 operators before fusion...
18 operators after fusion...
*** stack smashing detected ***: terminated
./server_gpu_experiments.sh: line 31: 1088568 Aborted                 (core dumped) ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --verbose --max-requests-per-batch $bs --max-sequence-length $max_sequence_length --max-tokens-per-batch $max_tokens_per_batch -tensor-parallelism-degree $ngpus --fusion -output-file ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.txt > ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.ou

when I set the number of cpu cores to 1, it will stuck.
Probably at here ./Flexflow/src/runtime/request_manager.cc::283:

if (get_num_ssms() == 0) {
    xxx
  } else {
    std::cout << "Num of SSMs: " << get_num_ssms() << std::endl;
    for (int i = 0; i < get_num_ssms(); i++) {
      BeamTree beam_tree = BeamTree{};
      request.beam_trees.push_back(beam_tree);
    }
  }

  pending_request_queue.push(request);
  all_requests[request.guid] = request;
  {
    const std::lock_guard<std::mutex> lock(request_to_promise_mutex);
    request_to_promise[request.guid] = new std::promise<void>();
  }

  {
    std::string output = "New request tokens:";
    output = "[" + std::to_string(request.guid) + "]" + output;
    for (int i = 0; i < request.tokens.size(); i++) {
      output = output + " " + std::to_string(request.tokens[i]);
    }
    log_req_mgr.print("%s", output.c_str());
  }

below is the log:

[0 - 7efdb03fc000]    1.025782 {3}{RequestManager}: [1011486]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889
[0]14350
[1]263
[2]26228
[3]21256
[4]1048
[5]7535
[6]17770
[7]363
[8]596
[9]10462
[10]29889
Num of SSMs: 1

stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant