v3.0.0-beta1
Pre-release
Pre-release
PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本,带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型,并优化了LLM推理代码,提升了兼容性和效率。
基础性能优化方面,添加了快速分词器,实现了MoE优化器参数广播,加速了层归一化。同时,修复了多个bug,包括safetensors shape切片问题和Windows下mmap问题,提升了系统稳定性和兼容性。
文档与测试方面,进行了全面更新和优化,确保了文档的准确性和代码的可读性。此外,还增强了国产硬件支持,包括DCU和XPU的优化,以及PIR模式和自动并行的配置更新。
主要变更与新增功能
1. 新模型与特性引入
- 新模型:在#8654 中引入了Yuan模型;在#8513 和#8517 中分别添加了mamba和jamba新模型,并在后续Pull Request中修复了相关bug,确保了模型的稳定运行。
- LLM推理优化:通过多个Pull Request,我们优化了LLM推理代码,并新增了对新模型和参数的支持,进一步提升了推理效率和兼容性。
2. 基础性能优化
- 快速分词器:在#8832 中,我们添加了基于
tokenizers
库的快速分词器,显著提升了分词速度和性能。 - MoE优化:在#8810 中,我们实现了MoE(Mixture of Experts)优化器参数的广播,有效增强了模型训练的效率。
- 层归一化加速:通过多个Pull Request,我们添加了fast_rmsnorm,启用了use_fast_layer_norm,并更新了基准测试配置,进一步加速了模型训练过程。特别是在#8717 中,我们支持了在微调过程中使用use_fast_layer_norm,为用户提供了更多灵活性。
- 训练性能优化:在#8803 中,我们添加了
enable_sp_async_reduce_scatter
选项,有效优化了训练性能。 - 字典参数支持:在#8446 中,我们为trainer的argparser添加了支持字典参数的新特性,增强了参数传递的灵活性。同时,在#8904 中,我们更新了tensorboard的要求,确保了与最新版本的兼容性。
3. Bug修复
- safetensors修复:在#8702 中,我们修复了safetensors的形状问题。
- Windows系统mmap修复:在#8734 中修复了mmap问题,提升了windows的兼容性。
- 其他Bug修复:包括#8687 、#8730 等多个Pull Request中的bug修复。
4. 文档与测试更新
- 文档优化:在多个Pull Request中,我们进行了文档更新、代码风格清理和版本信息更新,确保了文档的准确性和可读性。
- README修复与增强:在#8741 中,我们修复了README中的断链问题;同时,多个贡献者更新了README文档,添加了新的测试用例,确保了文档与代码的同步更新。
5. 其他重要变更
国产硬件支持增强
- DCU支持:在#8580 中,我们实现了针对DCU的高性能LLM训练和推理,拓展了PaddleNLP的硬件支持范围。
- XPU优化:在#8527 中,我们为XPU添加了LoRA优化;在#8697 和#8710 中,我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题,进一步提升了XPU上的模型训练效率。
PIR模式支持
- 导出与加载优化:在#8689 中,我们修改了PIR模式下llama模型的导出方式;在#8712 和#8766 中,我们支持了以三种模式(旧IR、PIR模型文件、PIR JSON文件)加载或保存Llama2-7b模型,为用户提供了更多灵活性和兼容性。
自动并行优化
- 配置更新:在#8679 中,我们更改了Llama2-7b配置中的
max_steps
以适应自动并行;在#8767 和#8828 中,我们优化了自动训练器的保存和加载功能;在#8750 中,我们更新了全局剪切的损失函数,进一步提升了自动并行的效率和准确性。
What's Changed
- [DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
- fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
- bug fix by @wtmlon in #8687
- [XPU] add lora optimization by @dynamicheart in #8527
- [pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
- [AutoParallel]Change
max_steps
in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679 - [benchmark] Change the mirror source for pip by @mmglove in #8699
- update loss base of auto-parallel tests by @zhiqiu in #8701
- Add new mistral by @wtmlon in #7425
- [Safetensors] Fix safetensors shape by @DesmonDay in #8702
- [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
- xpu use allgather by @FeixLiu in #8697
- add fast_rmsnorm by @deepllz in #8680
- enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
- fix xpu gather for unified ckpt by @FeixLiu in #8710
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
- fix fast_ln backward by @deepllz in #8719
- finetune support use_fast_layer_norm by @tianhaodongbd in #8717
- bug fix by @FeixLiu in #8730
- disable lora by @lugimzzz in #8674
- [Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
- correct broken links in readme by @jzhang533 in #8741
- revert benchmark fix by @ronny1996 in #8747
- [LLM] Add Yuan model by @zhaogf01 in #8654
- fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
- [LLM] Update sequence parallel linear import by @DrownFish19 in #8706
- [Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
- update a100 loss by @zhiqiu in #8708
- [PaddleNLP 3.0] Update README by @DrownFish19 in #8681
- [AutoParallel] update loss for global clip by @JZ-LIANG in #8750
- [NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
- [DEV] Update develop version show by @DrownFish19 in #8754
- [inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
- add benchmark baichuan2 scripts by @fightfat in #8683
- Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
- fix the ce for the unittest by @wawltor in #8772
- Enable parallel_config to use commas as delimiters. by @Difers in #8677
- fix incorrect token counting in
llm/predictor.py
by @lszxb in #8769 - Refine savable by @ZHUI in #8758
- [CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
- [XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
- fix version show by @DrownFish19 in #8791
- [BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
- vera-pissa method added by @TranscenderNing in #8722
- update version by @DrownFish19 in #8792
- [Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
- [DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
- [Prediction] Update LLM prediction. by @DesmonDay in #8778
- [Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
- [AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
- [MoE] Optimizer parameter broadcast by @DesmonDay in #8810
- [Doc] Update README by @DrownFish19 in #8817
- support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
- add paddle nv-embed-v1 by @Li-Z-Q in #8785
- fix pad_token_id bug by @yuanlehome in #8814
- [DCU] fix llama inference bug on DCU by @Deleter-D in #8815
- [Doc] Add LLaMA3.1 by @DrownFish19 in #8824
- [BUG] Fix build train valid test datasets by @JunnYu in #8826
- Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
- fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
- [AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
- [Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
- [Trainer] update clear_grad by @DesmonDay in #8829
- [Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
- [Inference LLM] support static c8 by @yuanlehome in #8833
- support sft mapdataset by @greycooker in #8840
- Cherry pick some changes from incubate branch by @sneaxiy in #8862
- support nested list of dict inputs by @deepllz in #8876
- Fix the bug with issues code 8641. by @smallbenxiong in #8880
- Fix the issue of P-tuning official sample error by @guangyunms in #8884
- modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
- [llm]fix zeropadding by @lugimzzz in #8895
- 修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
- enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
- Update run_pretrain.py by @ZHUI in #8902
- [doc] Update readme by @DrownFish19 in #8905
- [AutoParallel] Bugfix auto parallel FA by @JZ-LIANG in #8903
- [Readme] Update README.md by @ZHUI in #8908
- [cherry-pick] Optimize async save by @ForFishes in #8878
- [LLM Inference] Refactor BlockInferencePredictor by @yuanlehome in #8879
- [Fix] modify tensorboard requirements by @greycooker in #8904
- [LLM Inference] Support qwen2 by @yuanlehome in #8893
- modify dict include none to aviod pir dytostatic bug in while op by @xiaoguoguo626807 in #8898
- [LLM]Update yuan model by @zhaogf01 in #8786
- update qwen && baichuan benchmark config by @deepllz in #8920
- [doc] Update README by @DrownFish19 in #8922
- [ New features]Trainer support dict parameter by @greycooker in #8446
- set logging_step to 5 with baichuan && qwen benchmark by @deepllz in #8928
- [Cherry-pick]fix pipeline eval by @gongel in #8924
- fix test_wint8 ut by @yuanlehome in #8930
- [LLM Inference] support llama3.1 by @yuanlehome in #8929
- Fix tokens count for benchmark by @DrownFish19 in #8938
- [bug fix] fix create_optimizer_and_scheduler for auto_parallel by @zhangyuqin1998 in #8937
- [LLM Inference] fix _get_tensor_parallel_mappings in llama by @yuanlehome in #8939
- [Unified Checkpoint] Fix load best checkpoint by @DesmonDay in #8935
- fix bug by @yuanlehome in #8947
- [LLM Inference] move llm.utils.utils.py to paddlenlp.utils.llm_utils.py by @yuanlehome in #8946
- support amp in pir dy2st mode. by @winter-wang in #8485
- [Trainer] Fix distributed dataloader by @DesmonDay in #8932
- [Tokenizer] Add Fast Tokenizer by @DrownFish19 in #8832
- [ZeroPadding] add greedy_zero_padding by @DesmonDay in #8933
- [NEW Model] Add mamba by @JunnYu in #8513
- [BUG] fix mamba tokenizer by @JunnYu in #8958
- [NEW Model] add jamba by @JunnYu in #8517
- [LLM Inference] add --use_fake_parameter option for ptq fake scales and fix compute error of total_max_length by @yuanlehome in #8955
- [LLM Inference] support qwen2 a8w8c8 inference by @ckl117 in #8925
- fix JambaModelIntegrationTest by @JunnYu in #8965
- [Fix] Enable tensor parallel tests. by @ZHUI in #8757
- [CI] Fix by @DrownFish19 in #8793
- [Unified Checkpoint] update async save by @DesmonDay in #8801
- [AutoParallel] Support save model for auto trainer by @zhangbo9674 in #8927
- fix qwen benchmark by @deepllz in #8969
- [ZeroPadding] padding to max_length for sequence parallel by @DrownFish19 in #8973
- add amp unit test case for auto_parallel ci. by @winter-wang in #8966
- [New Version] Upgrade to 3.0 b1 by @ZHUI in #8977
New Contributors
- @yuguo-Jack made their first contribution in #8580
- @ruisunyc made their first contribution in #8698
- @xiaoguoguo626807 made their first contribution in #8689
- @lizexu123 made their first contribution in #8712
- @jzhang533 made their first contribution in #8741
- @zhaogf01 made their first contribution in #8654
- @lszxb made their first contribution in #8768
- @TranscenderNing made their first contribution in #8722
- @Deleter-D made their first contribution in #8800
- @Li-Z-Q made their first contribution in #8785
- @Hanyonggong made their first contribution in #8799
- @smallbenxiong made their first contribution in #8880
- @guangyunms made their first contribution in #8884
- @winter-wang made their first contribution in #8485
- @ckl117 made their first contribution in #8925
Full Changelog: v3.0.0-beta0...v3.0.0-beta1