Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama2 70B模型在不同PP下loss下降趋势不同 #81

Open
ZLkanyo009 opened this issue Mar 22, 2024 · 1 comment
Open

llama2 70B模型在不同PP下loss下降趋势不同 #81

ZLkanyo009 opened this issue Mar 22, 2024 · 1 comment

Comments

@ZLkanyo009
Copy link
Contributor

7_gpu_pp1.log
31_gpu_pp4.log
378074e481adec4316fc6dd978b7a61

在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=192.167.5.2
MASTER_PORT=29501
NUM_NODES=4
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))

CHECKPOINT_PATH='/data/zhangling21/ckpts/'
TENSORBOARD_LOGS_PATH='/data/zhangling21/tensorboard_logs/'
TOKENIZER_PATH='/data/zhangling21/llama_00_text_document/tokenizer/tokenizer.model'
DATA_PATH='/data/zhangling21/llama_00_text_document/llama_00_text_document'

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE
    --nnodes $NUM_NODES
    --node_rank $NODE_RANK
    --master_addr $MASTER_ADDR
    --master_port $MASTER_PORT
)
# --tokenizer-type LLaMASentencePieceTokenizer \
# --rmsnorm-epsilon 1e-5

LLAMA_MODEL_ARGS=(
    --num-layers 8
    --hidden-size 8192
    --ffn-hidden-size 28672
    --num-attention-heads 64
    --seq-length 4096
    --max-position-embeddings 4096
    --group-query-attention
    --num-query-groups 8
    --tokenizer-type Llama2Tokenizer
    --tokenizer-model $TOKENIZER_PATH
    --swiglu
    --normalization RMSNorm
    --use-rotary-position-embeddings
    --no-position-embedding
    --disable-bias-linear
)
# --optimizer adam
# --adam-eps 1e-05
# --no-contiguous-buffers-in-local-ddp
# --recompute-method uniform
# --no-async-tensor-model-parallel-allreduce
# --embedding-dropout 0
# --multi-query-attention
# --multi-query-group-num 8
# --ffn-dim-multiplier 1.3
# --recompute-granularity full
# --distribute-saved-activations
# --recompute-num-layers 1
# --memory-saving

# --fp16

    # --optimizer adam
    # --adam-eps 1e-05
TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 44
    --train-samples 24414
    --weight-decay 1e-2
    --optimizer adam
    --clip-grad 1.0
    --lr 0.00015
    --lr-decay-style cosine
    --min-lr 1.0e-5
    --lr-warmup-fraction .01
    --adam-beta1 0.9
    --adam-beta2 0.95
    --attention-dropout 0.0
    --hidden-dropout 0.0
    --untie-embeddings-and-output-weights
    --multiple-of 4096
    --no-gradient-accumulation-fusion
    --recompute-granularity 'full'
    --recompute-num-layers 1
    --recompute-method 'uniform'
    --no-async-tensor-model-parallel-allreduce
)

MODEL_PARALLEL_ARGS=(
        --tensor-model-parallel-size 8
        --pipeline-model-parallel-size 4
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 1
    --init-method-std 0.02
    --seed 1234
    --eval-iters 0
    --use-cpu-initialization
)
    #--load "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
    #--no-load-rng
    #--save "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
    #--save-interval 1

cmd="torchrun ${DISTRIBUTED_ARGS[@]} pretrain_llama.py \
        ${LLAMA_MODEL_ARGS[@]} \
        ${TRAINING_ARGS[@]} \
        ${MODEL_PARALLEL_ARGS[@]} \
        ${DATA_ARGS[@]} \
        ${EVAL_AND_LOGGING_ARGS[@]}"
echo $cmd
eval $cmd
@ZLkanyo009
Copy link
Contributor Author

@zhaoyinglia 您好,能麻烦看一下这个问题吗? @aoyulong 最近比较忙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant