Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add t5_for_conditional_generation and gpt_for_sequence_classification benchmark #1581

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,9 @@
[submodule "frame_benchmark/pytorch/dynamic/PaddleOCR/models/DBNet_pytorch"]
path = frame_benchmark/pytorch/dynamic/PaddleOCR/models/DBNet_pytorch
url = https://github.com/PaddleBenchmark/DBNet_pytorch
[submodule "frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch"]
path = frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch
url = https://github.com/pytorch/pytorch.git
[submodule "frame_benchmark/pytorch/dynamic/PaddleClas/models/pytorch-image-models"]
path = frame_benchmark/pytorch/dynamic/PaddleClas/models/pytorch-image-models
url = https://github.com/rwightman/pytorch-image-models
Expand Down
3 changes: 3 additions & 0 deletions frame_benchmark/docker_images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,6 @@ pytorch:
segformer_b0: iregistry.baidu-int.com/paddle-benchmark/paddlecloud-base-image:paddlecloud-ubuntu18.04-gcc8.2-cuda11.2-cudnn8
xlnet: iregistry.baidu-int.com/paddle-benchmark/paddlecloud-base-image:paddlecloud-ubuntu18.04-gcc8.2-cuda11.2-cudnn8
gpt3: iregistry.baidu-int.com/paddlecloud/base-images:paddlecloud-ubuntu18.04-gcc8.2-cuda11.1-cudnn8
bert_for_question_answering: registry.baidubce.com/paddlepaddle/paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4
gpt_for_sequence_classification: registry.baidubce.com/paddlepaddle/paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4
t5_for_conditional_generation: registry.baidubce.com/paddlepaddle/paddle:2.4.1-gpu-cuda11.7-cudnn8.4-trt8.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请更新为镜像 iregistry.baidu-int.com/paddlecloud/base-images:paddlecloud-ubuntu18.04-gcc8.2-cuda11.7-cudnn8.4.1-nccl2.12.12

3 changes: 3 additions & 0 deletions frame_benchmark/models_path.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,6 @@ pytorch:
bert_base_seqlen128: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/DeepLearningExamples
bert_large_seqlen512: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/DeepLearningExamples
gpt3: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/Megatron-LM
bert_for_question_answering: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch
gpt_for_sequence_classification: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch
t5_for_conditional_generation: benchmark/frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch
1 change: 1 addition & 0 deletions frame_benchmark/pytorch/dynamic/PaddleNLP/models/pytorch
Submodule pytorch added at dfbdfb
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=gpt_for_sequence_classification
bs_item=8
fp_item=fp16
run_process_type=SingleP
run_mode=DP
device_num=N1C1

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=gpt_for_sequence_classification
bs_item=8
fp_item=fp32
run_process_type=SingleP
run_mode=DP
device_num=N1C1

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=gpt_for_sequence_classification
bs_item=8
fp_item=fp16
run_process_type=SingleP
run_mode=DP
device_num=N1C8

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=gpt_for_sequence_classification
bs_item=8
fp_item=fp32
run_process_type=SingleP
run_mode=DP
device_num=N1C8

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
echo "******prepare benchmark start************"

echo "https_proxy $HTTPS_PRO"
echo "http_proxy $HTTP_PRO"
export https_proxy=$HTTPS_PRO
export http_proxy=$HTTP_PRO
export no_proxy=localhost,bj.bcebos.com,su.bcebos.com

# wget ${FLAG_TORCH_WHL_URL}

# tar -xf torch_dev_whls.tar

# pip install torch_dev_whls/*

# pip install transformers pandas psutil scipy

git checkout .

# rm current torch
rm -rf torch_tmp/
mv torch torch_tmp

sed -i '1842,1844d' benchmarks/dynamo/common.py
sed -i '929,933d' benchmarks/dynamo/common.py
sed -i '929i \ \ \ \ \ \ \ \ \ \ \ \ os.environ[\"MASTER_ADDR\"] = os.getenv(\"MASTER_ADDR\", \"localhost\")\n os.environ[\"MASTER_PORT\"] = os.getenv(\"MASTER_PORT\", \"12355\")\n os.environ[\"RANK\"] = os.getenv(\"RANK\", \"0\")\n os.environ[\"WORLD_SIZE\"] = os.getenv(\"WORLD_SIZE\", \"1\")\n torch.cuda.set_device(int(os.environ[\"RANK\"]))\n torch.distributed.init_process_group(\n \"nccl\"\n )' benchmarks/dynamo/common.py

rm -f ./speedup_eager*
rm -f ./speedup_inductor*

echo "******prepare benchmark end************"
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# encoding=utf-8 vi:ts=4:sw=4:expandtab:ft=python

import json
import os
import argparse


def analyze(args, run_info):
log_file = args.filename
res_log_file = args.res_log_file

index_c = args.device_num.index('C')
print("---index_c:-", index_c)
gpu_num = int(args.device_num[index_c + 1:len(args.device_num)])

# caculate and update ips
all_speed_logs =[]
with open(log_file, 'r', encoding="utf8") as f:
for line in f.readlines()[-gpu_num:]:
ms_per_batch = float(line.strip().split(",")[4])
tokens_per_second = 1000.0 / ms_per_batch * run_info["batch_size"] * args.sequence_length
all_speed_logs.append(tokens_per_second)

ips = sum(all_speed_logs) / len(all_speed_logs)
run_info["ips"] = round(ips, 3)

# write file
run_info = json.dumps(run_info)
print(run_info)
with open(res_log_file, "w") as of:
of.write(run_info)

if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--filename", type=str, help="The name of log which need to analysis.")
parser.add_argument(
"--sequence_length", type=int, help="The sequence length for any batch.")
parser.add_argument(
'--res_log_file', type=str, help='speed log file')
parser.add_argument(
'--model_name', type=str, default=0, help='training model_name, transformer_base')
parser.add_argument(
'--device_num', type=str, default="N1C1", help='N1C1|N1C8|N4C32')
parser.add_argument(
'--run_process_type', type=str, default="SingleP", help='multi process or single process')

args = parser.parse_args()
base_batch_size, fp_item, run_mode = args.model_name.split("_")[-3:]
base_batch_size = int(base_batch_size.replace("bs",""))

run_info = {
"model_branch": os.getenv('model_branch'),
"model_commit": os.getenv('model_commit'),
"model_name": args.model_name,
"batch_size": base_batch_size,
"fp_item": fp_item,
"run_process_type": args.run_process_type,
"run_mode": run_mode,
"convergence_value": 0,
"convergence_key": "",
"ips": 0, # we need update ips
"speed_unit": "tokens/s",
"device_num": args.device_num,
"model_run_time": os.getenv('model_run_time'),
"frame_commit": "",
"frame_version": os.getenv('frame_version'),
}
analyze(args, run_info)
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/usr/bin/env bash
# Test training benchmark for a model.
# Usage: CUDA_VISIBLE_DEVICES=xxx bash run_benchmark.sh ${model_name} ${run_mode} ${fp_item} ${bs_item} ${max_iter} ${num_workers}
function _set_params(){
model_item=${1:-"gpt_for_sequence_classification"} # (必选) 模型 item |fastscnn|segformer_b0| ocrnet_hrnetw48
base_batch_size=${2:-"8"} # (必选) 每张卡上的batch_size
fp_item=${3:-"fp32"} # (必选) fp32|fp16
run_process_type=${4:-"MultiP"} # (必选) 单进程 SingleP|多进程 MultiP
run_mode=${5:-"DP"} # (必选) MP模型并行|DP数据并行|PP流水线并行|混合并行DP1-MP1-PP1|DP1-MP4-PP1
device_num=${6:-"N1C1"} # (必选) 使用的卡数量,N1C1|N1C8|N4C8 (4机32卡)
profiling=${PROFILING:-"false"} # (必选) Profiling 开关,默认关闭,通过全局变量传递
model_repo="pytorch" # (必选) 模型套件的名字
speed_unit="tokens/s" # (必选)速度指标单位
skip_steps=10 # (必选)解析日志,跳过模型前几个性能不稳定的step
keyword="|tokens/s" # (必选)解析日志,筛选出性能数据所在行的关键字
convergence_key="" # (可选)解析日志,筛选出收敛数据所在行的关键字 如:convergence_key="loss:"
max_iter=${7:-"100"} # (可选)需保证模型执行时间在5分钟内,需要修改代码提前中断的直接提PR 合入套件 或是max_epoch
num_workers=${8:-"3"} # (可选)

# Added for distributed training
node_num=${9:-"2"} #(可选) 节点数量
node_rank=${10:-"0"} # (可选) 节点rank
master_addr=${11:-"127.0.0.1"} # (可选) 主节点ip地址
master_port=${12:-"1928"} # (可选) 主节点端口号
# Added for distributed training

# 以下为通用拼接log路径,无特殊可不用修改
model_name=${model_item}_bs${base_batch_size}_${fp_item}_${run_mode} # (必填) 切格式不要改动,与平台页面展示对齐
device=${CUDA_VISIBLE_DEVICES//,/ }
arr=(${device})
num_gpu_devices=${#arr[*]}
run_log_path=${TRAIN_LOG_DIR:-$(pwd)} # (必填) TRAIN_LOG_DIR benchmark框架设置该参数为全局变量
profiling_log_path=${PROFILING_LOG_DIR:-$(pwd)} # (必填) PROFILING_LOG_DIR benchmark框架设置该参数为全局变量
speed_log_path=${LOG_PATH_INDEX_DIR:-$(pwd)}
# mmsegmentation_fastscnn_bs2_fp32_MultiP_DP_N1C1_log
train_log_file=${run_log_path}/${model_repo}_${model_name}_${device_num}_log
profiling_log_file=${profiling_log_path}/${model_repo}_${model_name}_${device_num}_profiling
speed_log_file=${speed_log_path}/${model_repo}_${model_name}_${device_num}_speed
if [ ${profiling} = "true" ];then
add_options="profiler_options=/"batch_range=[50, 60]; profile_path=model.profile/""
log_file=${profiling_log_file}
else
add_options=""
log_file=${train_log_file}
fi
use_com_args="eager"
if [ ${FLAG_TORCH_COMPILE} = "True" ];then
use_com_args="inductor"
fi
}
function _analysis_log(){
python analysis_log.py --filename "./speedup_${use_com_args}.csv" --sequence_length 1024 --model_name ${model_name} --run_process_type ${run_process_type} --device_num=${device_num} --res_log_file=${speed_log_file}
}
function _train(){
batch_size=${base_batch_size} # 如果模型跑多卡但进程时,请在_train函数中计算出多卡需要的bs
echo "current ${model_name} CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES, gpus=${device_num}, batch_size=${batch_size}"
train_config=" --float32 "
if [ ${fp_item} = "fp16" ];then
train_config=" --amp "
fi
train_options=" --batch_size ${batch_size} \
--training \
--backend=${use_com_args} \
--performance \
--only=GPT2ForSequenceClassification \
--output-directory ./ "

train_script="benchmarks/dynamo/huggingface.py"
case ${run_process_type} in
SingleP) train_cmd="python -u ${train_script} ${train_config} ${train_options}" ;;
MultiP)
if [ ${device_num:3} = '32' ];then
train_cmd="torchrun --nproc_per_node=${num_workers} --nnodes=${node_num} --node_rank=${node_rank} --master_addr=${master_addr} --master_port=${master_port} ${train_script} ${train_config} ${train_options}"
else
train_cmd="torchrun --nproc_per_node=${num_workers} ${train_script} ${train_config} ${train_options}"
fi;;
*) echo "choose run_mode(SingleP or MultiP)"; exit 1;
esac
# 以下为通用执行命令,无特殊可不用修改
echo ${train_cmd}
timeout 15m ${train_cmd} > ${log_file} 2>&1
if [ $? -ne 0 ];then
echo -e "${model_name}, FAIL"
else
echo -e "${model_name}, SUCCESS"
fi
if [ ${run_process_type} = "MultiP" -a -d mylog ]; then
rm ${log_file}
cp mylog/workerlog.0 ${log_file}
fi
echo ${train_cmd} >> ${log_file}
cat ${log_file}
#kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
}
_set_params $@
export frame_version=`python -c "import torch;print(torch.__version__)"`
echo "---------frame_version is torch ${frame_version}"
echo "---------model_branch is ${model_branch}"
echo "---------model_commit is ${model_commit}"
job_bt=`date '+%Y%m%d%H%M%S'`
_train
job_et=`date '+%Y%m%d%H%M%S'`
export model_run_time=$((${job_et}-${job_bt}))
_analysis_log
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=t5_for_conditional_generation
bs_item=4
fp_item=fp16
run_process_type=SingleP
run_mode=DP
device_num=N1C1

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=t5_for_conditional_generation
bs_item=4
fp_item=fp32
run_process_type=SingleP
run_mode=DP
device_num=N1C1

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=t5_for_conditional_generation
bs_item=4
fp_item=fp16
run_process_type=SingleP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N1C8 该字段为MultiP

run_mode=DP
device_num=N1C8

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model_item=t5_for_conditional_generation
bs_item=4
fp_item=fp32
run_process_type=SingleP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N1C8 该字段为MultiP

run_mode=DP
device_num=N1C8

sed -i '/set\ -xe/d' run_benchmark.sh
bash PrepareEnv.sh;
CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${model_item} ${bs_item} ${fp_item} ${run_process_type} ${run_mode} ${device_num} 2>&1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
echo "******prepare benchmark start************"

echo "https_proxy $HTTPS_PRO"
echo "http_proxy $HTTP_PRO"
export https_proxy=$HTTPS_PRO
export http_proxy=$HTTP_PRO
export no_proxy=localhost,bj.bcebos.com,su.bcebos.com

wget ${FLAG_TORCH_WHL_URL}

tar -xf torch_dev_whls.tar

pip install torch_dev_whls/*

pip install transformers pandas psutil scipy

git checkout .

# rm current torch
rm -rf torch_tmp/
mv torch torch_tmp

sed -i '1842,1844d' benchmarks/dynamo/common.py
sed -i '929,933d' benchmarks/dynamo/common.py
sed -i '929i \ \ \ \ \ \ \ \ \ \ \ \ os.environ[\"MASTER_ADDR\"] = os.getenv(\"MASTER_ADDR\", \"localhost\")\n os.environ[\"MASTER_PORT\"] = os.getenv(\"MASTER_PORT\", \"12355\")\n os.environ[\"RANK\"] = os.getenv(\"RANK\", \"0\")\n os.environ[\"WORLD_SIZE\"] = os.getenv(\"WORLD_SIZE\", \"1\")\n torch.cuda.set_device(int(os.environ[\"RANK\"]))\n torch.distributed.init_process_group(\n \"nccl\"\n )' benchmarks/dynamo/common.py

rm -f ./speedup_eager*
rm -f ./speedup_inductor*

echo "******prepare benchmark end************"
Loading