Release v3.0.0-beta1 · PaddlePaddle/PaddleNLP

PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本，带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型，并优化了LLM推理代码，提升了兼容性和效率。

基础性能优化方面，添加了快速分词器，实现了MoE优化器参数广播，加速了层归一化。同时，修复了多个bug，包括safetensors shape切片问题和Windows下mmap问题，提升了系统稳定性和兼容性。

文档与测试方面，进行了全面更新和优化，确保了文档的准确性和代码的可读性。此外，还增强了国产硬件支持，包括DCU和XPU的优化，以及PIR模式和自动并行的配置更新。

主要变更与新增功能

1. 新模型与特性引入

新模型：在#8654 中引入了Yuan模型；在#8513 和#8517 中分别添加了mamba和jamba新模型，并在后续Pull Request中修复了相关bug，确保了模型的稳定运行。
LLM推理优化：通过多个Pull Request，我们优化了LLM推理代码，并新增了对新模型和参数的支持，进一步提升了推理效率和兼容性。

2. 基础性能优化

快速分词器：在#8832 中，我们添加了基于tokenizers库的快速分词器，显著提升了分词速度和性能。
MoE优化：在#8810 中，我们实现了MoE（Mixture of Experts）优化器参数的广播，有效增强了模型训练的效率。
层归一化加速：通过多个Pull Request，我们添加了fast_rmsnorm，启用了use_fast_layer_norm，并更新了基准测试配置，进一步加速了模型训练过程。特别是在#8717 中，我们支持了在微调过程中使用use_fast_layer_norm，为用户提供了更多灵活性。
训练性能优化：在#8803 中，我们添加了enable_sp_async_reduce_scatter选项，有效优化了训练性能。
字典参数支持：在#8446 中，我们为trainer的argparser添加了支持字典参数的新特性，增强了参数传递的灵活性。同时，在#8904 中，我们更新了tensorboard的要求，确保了与最新版本的兼容性。

3. Bug修复

safetensors修复：在#8702 中，我们修复了safetensors的形状问题。
Windows系统mmap修复：在#8734 中修复了mmap问题，提升了windows的兼容性。
其他Bug修复：包括#8687 、#8730 等多个Pull Request中的bug修复。

4. 文档与测试更新

文档优化：在多个Pull Request中，我们进行了文档更新、代码风格清理和版本信息更新，确保了文档的准确性和可读性。
README修复与增强：在#8741 中，我们修复了README中的断链问题；同时，多个贡献者更新了README文档，添加了新的测试用例，确保了文档与代码的同步更新。

5. 其他重要变更

国产硬件支持增强

DCU支持：在#8580 中，我们实现了针对DCU的高性能LLM训练和推理，拓展了PaddleNLP的硬件支持范围。
XPU优化：在#8527 中，我们为XPU添加了LoRA优化；在#8697 和#8710 中，我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题，进一步提升了XPU上的模型训练效率。

PIR模式支持

导出与加载优化：在#8689 中，我们修改了PIR模式下llama模型的导出方式；在#8712 和#8766 中，我们支持了以三种模式（旧IR、PIR模型文件、PIR JSON文件）加载或保存Llama2-7b模型，为用户提供了更多灵活性和兼容性。

自动并行优化

配置更新：在#8679 中，我们更改了Llama2-7b配置中的max_steps以适应自动并行；在#8767 和#8828 中，我们优化了自动训练器的保存和加载功能；在#8750 中，我们更新了全局剪切的损失函数，进一步提升了自动并行的效率和准确性。

What's Changed

[DCU] high performance LLM train and inference for DCU by @yuguo-Jack in #8580
fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by @fightfat in #8678
bug fix by @wtmlon in #8687
[XPU] add lora optimization by @dynamicheart in #8527
[pir save] Modiy export llama model file in pir mode by @xiaoguoguo626807 in #8689
[AutoParallel]Change max_steps in Llama2-7b config for auto-parallel. by @heavyrain-lzy in #8679
[benchmark] Change the mirror source for pip by @mmglove in #8699
update loss base of auto-parallel tests by @zhiqiu in #8701
Add new mistral by @wtmlon in #7425
[Safetensors] Fix safetensors shape by @DesmonDay in #8702
[BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by @JunnYu in #8690
xpu use allgather by @FeixLiu in #8697
add fast_rmsnorm by @deepllz in #8680
enable use_fast_layer_norm for llama2 benchmark by @deepllz in #8714
fix xpu gather for unified ckpt by @FeixLiu in #8710
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8712
fix fast_ln backward by @deepllz in #8719
finetune support use_fast_layer_norm by @tianhaodongbd in #8717
bug fix by @FeixLiu in #8730
disable lora by @lugimzzz in #8674
[Safetensors] Fix mmap for Windows system by @DrownFish19 in #8734
correct broken links in readme by @jzhang533 in #8741
revert benchmark fix by @ronny1996 in #8747
[LLM] Add Yuan model by @zhaogf01 in #8654
fix nlp dir and auto_parallel_ci exit -6 by @fightfat in #8744
[LLM] Update sequence parallel linear import by @DrownFish19 in #8706
[Bug fixes] Fix ring attention by @zhangyuqin1998 in #8740
update a100 loss by @zhiqiu in #8708
[PaddleNLP 3.0] Update README by @DrownFish19 in #8681
[AutoParallel] update loss for global clip by @JZ-LIANG in #8750
[NPU] Fix sequence parallel lib import by @DrownFish19 in #8760
[DEV] Update develop version show by @DrownFish19 in #8754
[inference] support load or save Llama2-7b in three patterns by @lizexu123 in #8766
add benchmark baichuan2 scripts by @fightfat in #8683
Add the missing truncation=True in llm/predictor.py by @lszxb in #8768
fix the ce for the unittest by @wawltor in #8772
Enable parallel_config to use commas as delimiters. by @Difers in #8677
fix incorrect token counting in llm/predictor.py by @lszxb in #8769
Refine savable by @ZHUI in #8758
[CodeStyle] remove markdownlint-cli by @DrownFish19 in #8779
[XPU] use allgather and fp32 multinomial for XPU by @houj04 in #8787
fix version show by @DrownFish19 in #8791
[BUG] Add 20 redundant data in post pretrain by @JunnYu in #8789
vera-pissa method added by @TranscenderNing in #8722
update version by @DrownFish19 in #8792
[Inference LLM] refine some code in llama wint8/4 by @yuanlehome in #8796
[DCU] Llama a8w8 inference performance optimization by @Deleter-D in #8800
[Prediction] Update LLM prediction. by @DesmonDay in #8778
[Trainer] Add enable_sp_async_reduce_scatter by @DesmonDay in #8803
[AutoParallel] Refine auto_trainer save load by @zhangbo9674 in #8767
[MoE] Optimizer parameter broadcast by @DesmonDay in #8810
[Doc] Update README by @DrownFish19 in #8817
support Llama3.1 8B 128K generation on single GPU 80GB by @GuoxiaWang in #8811
add paddle nv-embed-v1 by @Li-Z-Q in #8785
fix pad_token_id bug by @yuanlehome in #8814
[DCU] fix llama inference bug on DCU by @Deleter-D in #8815
[Doc] Add LLaMA3.1 by @DrownFish19 in #8824
[BUG] Fix build train valid test datasets by @JunnYu in #8826
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by @Hanyonggong in #8799
fix tune_cublaslt_gemm compile bug by @yuanlehome in #8844
[AutoParallel] Refine save and load ckpt for auto_trainer by @zhangbo9674 in #8828
[Unified Checkpoint] update merge tensor parallel by @DesmonDay in #8856
[Trainer] update clear_grad by @DesmonDay in #8829
[Unified Checkpoint] Fix tie_word_embeddings by @DesmonDay in #8795
[Inference LLM] support static c8 by @yuanlehome in #8833
support sft mapdataset by @greycooker in #8840
Cherry pick some changes from incubate branch by @sneaxiy in #8862
support nested list of dict inputs by @deepllz in #8876
Fix the bug with issues code 8641. by @smallbenxiong in #8880
Fix the issue of P-tuning official sample error by @guangyunms in #8884
modify Paddlemix qwen dytostatic by @xiaoguoguo626807 in #8869
[llm]fix zeropadding by @lugimzzz in #8895
修复fast_ln算子动半开启后报错 by @Wennie396 in #8891
enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by @deepllz in #8897
Update run_pretrain.py by @ZHUI in #8902
[doc] Update readme by @DrownFish19 in #8905
[AutoParallel] Bugfix auto parallel FA by @JZ-LIANG in #8903
[Readme] Update README.md by @ZHUI in #8908
[cherry-pick] Optimize async save by @ForFishes in #8878
[LLM Inference] Refactor BlockInferencePredictor by @yuanlehome in #8879
[Fix] modify tensorboard requirements by @greycooker in #8904
[LLM Inference] Support qwen2 by @yuanlehome in #8893
modify dict include none to aviod pir dytostatic bug in while op by @xiaoguoguo626807 in #8898
[LLM]Update yuan model by @zhaogf01 in #8786
update qwen && baichuan benchmark config by @deepllz in #8920
[doc] Update README by @DrownFish19 in #8922
[ New features]Trainer support dict parameter by @greycooker in #8446
set logging_step to 5 with baichuan && qwen benchmark by @deepllz in #8928
[Cherry-pick]fix pipeline eval by @gongel in #8924
fix test_wint8 ut by @yuanlehome in #8930
[LLM Inference] support llama3.1 by @yuanlehome in #8929
Fix tokens count for benchmark by @DrownFish19 in #8938
[bug fix] fix create_optimizer_and_scheduler for auto_parallel by @zhangyuqin1998 in #8937
[LLM Inference] fix _get_tensor_parallel_mappings in llama by @yuanlehome in #8939
[Unified Checkpoint] Fix load best checkpoint by @DesmonDay in #8935
fix bug by @yuanlehome in #8947
[LLM Inference] move llm.utils.utils.py to paddlenlp.utils.llm_utils.py by @yuanlehome in #8946
support amp in pir dy2st mode. by @winter-wang in #8485
[Trainer] Fix distributed dataloader by @DesmonDay in #8932
[Tokenizer] Add Fast Tokenizer by @DrownFish19 in #8832
[ZeroPadding] add greedy_zero_padding by @DesmonDay in #8933
[NEW Model] Add mamba by @JunnYu in #8513
[BUG] fix mamba tokenizer by @JunnYu in #8958
[NEW Model] add jamba by @JunnYu in #8517
[LLM Inference] add --use_fake_parameter option for ptq fake scales and fix compute error of total_max_length by @yuanlehome in #8955
[LLM Inference] support qwen2 a8w8c8 inference by @ckl117 in #8925
fix JambaModelIntegrationTest by @JunnYu in #8965
[Fix] Enable tensor parallel tests. by @ZHUI in #8757
[CI] Fix by @DrownFish19 in #8793
[Unified Checkpoint] update async save by @DesmonDay in #8801
[AutoParallel] Support save model for auto trainer by @zhangbo9674 in #8927
fix qwen benchmark by @deepllz in #8969
[ZeroPadding] padding to max_length for sequence parallel by @DrownFish19 in #8973
add amp unit test case for auto_parallel ci. by @winter-wang in #8966
[New Version] Upgrade to 3.0 b1 by @ZHUI in #8977

New Contributors

@yuguo-Jack made their first contribution in #8580
@ruisunyc made their first contribution in #8698
@xiaoguoguo626807 made their first contribution in #8689
@lizexu123 made their first contribution in #8712
@jzhang533 made their first contribution in #8741
@zhaogf01 made their first contribution in #8654
@lszxb made their first contribution in #8768
@TranscenderNing made their first contribution in #8722
@Deleter-D made their first contribution in #8800
@Li-Z-Q made their first contribution in #8785
@Hanyonggong made their first contribution in #8799
@smallbenxiong made their first contribution in #8880
@guangyunms made their first contribution in #8884
@winter-wang made their first contribution in #8485
@ckl117 made their first contribution in #8925

Full Changelog: v3.0.0-beta0...v3.0.0-beta1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0-beta1