Releases: hpcaitech/ColossalAI
v0.1.4 Released!
Main Features
Here are the main improvements of this release:
- ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
- Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
- CLI: a command-line tool that helps users launch distributed training tasks more easily.
- Pipeline Parallelism (PP): a more user-friendly API for PP.
What's Changed
ColoTensor
- [tensor]fix colo_tensor torch_function by @Wesley-Jzy in #825
- [tensor]fix test_linear by @Wesley-Jzy in #826
- [tensor] ZeRO use ColoTensor as the base class. by @feifeibear in #828
- [tensor] revert zero tensors back by @feifeibear in #829
- [Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in #889
- [tensor] refine linear and add gather for laynorm by @Wesley-Jzy in #893
- [Tensor] test parameters() as member function by @feifeibear in #896
- [Tensor] activation is an attr of ColoTensor by @feifeibear in #897
- [Tensor] initialize the ColoOptimizer by @feifeibear in #898
- [tensor] reorganize files by @feifeibear in #820
- [Tensor] apply ColoTensor on Torch functions by @feifeibear in #821
- [Tensor] update ColoTensor torch_function by @feifeibear in #822
- [tensor] lazy init by @feifeibear in #823
- [WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in #831
- Init Conext supports lazy allocate model memory by @feifeibear in #842
- [Tensor] TP Linear 1D row by @Wesley-Jzy in #843
- [Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in #846
- [Tensor] init a simple network training with ColoTensor by @feifeibear in #849
- [Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in #854
- [Tensor] add layer norm Op by @feifeibear in #852
- [tensor] an initial dea of tensor spec by @feifeibear in #865
- [Tensor] colo init context add device attr. by @feifeibear in #866
- [tensor] add cross_entropy_loss by @feifeibear in #868
- [Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in #869
- [tensor] customized op returns ColoTensor by @feifeibear in #875
- [Tensor] get named parameters for model using ColoTensors by @feifeibear in #874
- [Tensor] Add some attributes to ColoTensor by @feifeibear in #877
- [Tensor] make a simple net works with 1D row TP by @feifeibear in #879
- [tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in #881
- [Tensor] make ColoTensor more robust for getattr by @feifeibear in #886
- [Tensor] test model check results for a simple net by @feifeibear in #887
- [tensor] add ColoTensor 1Dcol by @Wesley-Jzy in #888
Gemini + ZeRO
- [zero] add zero tensor shard strategy by @1SAA in #793
- Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
- [gemini] a new tensor structure by @feifeibear in #818
- [gemini] APIs to set cpu memory capacity by @feifeibear in #809
- [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
- [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
- [gemini] add GeminiMemoryManger by @1SAA in #832
- [zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
- [gemini] polish code by @1SAA in #855
- [gemini] add stateful tensor container by @1SAA in #867
- [gemini] polish stateful_tensor_mgr by @1SAA in #876
- [gemini] accelerate adjust_layout() by @ver217 in #878
CLI
- [cli] added distributed launcher command by @YuliangLiu0306 in #791
- [cli] added micro benchmarking for tp by @YuliangLiu0306 in #789
- [cli] add missing requirement by @FrankLeeeee in #805
- [cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in #807
- [cli] fixed single-node process launching by @FrankLeeeee in #812
- [cli] added check installation cli by @FrankLeeeee in #815
- [CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in #844
- [cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in #858
Pipeline Parallelism
- [pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in #816
- [pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in #853
Misc
- [hotfix] fix auto tensor placement policy by @ver217 in #775
- [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
- [hotfix] fix bugs in zero by @1SAA in #781
- [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
- [refactor] moving memtracer to gemini by @feifeibear in #801
- [log] display tflops if available by @feifeibear in #802
- [refactor] moving grad acc logic to engine by @feifeibear in #804
- [log] local throughput metrics by @feifeibear in #811
- [Bot] Synchronize Submodule References by @github-actions in #810
- [Bot] Synchronize Submodule References by @github-actions in #819
- [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
- [setup] allow installation with python 3.6 by @FrankLeeeee in #834
- Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
- [dependency] removed torchvision by @FrankLeeeee in #833
- [Bot] Synchronize Submodule References by @github-actions in #827
- [unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
- [setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
- [hotfix] ColoTensor pin_memory by @feifeibear in #840
- modefied the pp build for ckpt adaptation by @Gy-Lu in #803
- [hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
- [hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
- [hotfix] add deconstructor for stateful tensor by @ver217 in #848
- [utils] refactor profiler by @ver217 in #837
- [ci] cache cuda extension by @FrankLeeeee in #860
- hotfix tensor unittest bugs by @feifeibear in #862
- [usability] added assertion message in registry by @FrankLeeeee in #864
- [doc] improved docstring in the communication module by @FrankLeeeee in #863
- [doc] improved docstring in the logging module by @FrankLeeeee in #861
- [doc] improved docstring in the amp module by @FrankLeeeee in #857
- [usability] improved error messages in the context modu...
V0.1.3 Released!
Overview
Here are the main improvements of this release:
- Gemini: Heterogeneous memory space manager
- Refactor the API of pipeline parallelism
What's Changed
Features
- [zero] initialize a stateful tensor manager by @feifeibear in #614
- [pipeline] refactor pipeline by @YuliangLiu0306 in #679
- [zero] stateful tensor manager by @ver217 in #687
- [zero] adapt zero hooks for unsharded module by @1SAA in #699
- [zero] refactor memstats collector by @ver217 in #706
- [zero] improve adaptability for not-shard parameters by @1SAA in #708
- [zero] check whether gradients have inf and nan in gpu by @1SAA in #712
- [refactor] refactor the memory utils by @feifeibear in #715
- [util] support detection of number of processes on current node by @FrankLeeeee in #723
- [utils] add synchronized cuda memory monitor by @1SAA in #740
- [zero] refactor ShardedParamV2 by @1SAA in #742
- [zero] add tensor placement policies by @ver217 in #743
- [zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
- [zero] refactor memstats_collector by @1SAA in #746
- [gemini] init genimi individual directory by @feifeibear in #754
- refactor shard and gather operation by @1SAA in #773
Bug Fix
- [zero] fix init bugs in zero context by @1SAA in #686
- [hotfix] update requirements-test by @ver217 in #701
- [hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
- [compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
- [hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
- [hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
- [bug] fixed broken test_found_inf by @FrankLeeeee in #725
- [util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
- [util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
- [bug] removed zero installation requirements by @FrankLeeeee in #731
- [hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
- [utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
- [bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
- [bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
- [hotfix] fix memory leak in backward of sharded model by @ver217 in #741
- [hotfix] fix initialize about zero by @ver217 in #748
- [hotfix] fix prepare grads in sharded optim by @ver217 in #749
- [hotfix] layernorm by @kurisusnowdeng in #750
- [hotfix] fix auto tensor placement policy by @ver217 in #753
- [hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
- [hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
- [compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
- [hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
- [hotfix] polish sharded optim docstr and warning by @ver217 in #770
Unit Testing
- [ci] replace the ngc docker image with self-built pytorch image by @FrankLeeeee in #672
- [ci] fixed compatibility workflow by @FrankLeeeee in #678
- [ci] update workflow trigger condition and support options by @FrankLeeeee in #691
- [ci] added missing field in workflow by @FrankLeeeee in #692
- [ci] remove ipc config for rootless docker by @FrankLeeeee in #694
- [test] added missing decorators to model checkpointing tests by @FrankLeeeee in #727
- [unitest] add checkpoint for moe zero test by @1SAA in #729
- [test] added a decorator for address already in use error with backward compatibility by @FrankLeeeee in #760
- [test] refactored with the new rerun decorator by @FrankLeeeee in #763
Documentation
- add PaLM link by @binmakeswell in #704
- [doc] removed outdated installation command by @FrankLeeeee in #730
- add video by @binmakeswell in #732
- [readme] polish readme by @feifeibear in #764
- [readme] sync CN readme by @binmakeswell in #766
Miscellaneous
- [Bot] Synchronize Submodule References by @github-actions in #556
- [Bot] Synchronize Submodule References by @github-actions in #695
- [refactor] zero directory by @feifeibear in #724
- [Bot] Synchronize Submodule References by @github-actions in #751
Full Changelog: v0.1.2...v0.1.3
V0.1.2 Released!
Overview
Here are the main improvements of this release:
- MOE and BERT models can be trained with ZeRO.
- Provide a uniform checkpoint for all kinds of parallelism.
- Optimize ZeRO-offload, and improve model scaling.
- Design a uniform model memory tracer.
- Implement an efficient hybrid Adam (CPU and CUDA kernels).
- Improve activation offloading.
- Profiler TensorBoard plugin of Beta version.
- Refactor pipeline module for closer integration with engine.
- Chinese tutorials, WeChat and Slack user groups.
What's Changed
Features
- [zero] get memory usage for sharded param by @feifeibear in #536
- [zero] improve the accuracy of get_memory_usage of sharded param by @feifeibear in #538
- [zero] refactor model data tracing by @feifeibear in #537
- [zero] get memory usage of sharded optim v2. by @feifeibear in #542
- [zero] polish ZeroInitContext by @ver217 in #540
- [zero] optimize grad offload by @ver217 in #539
- [zero] non model data tracing by @feifeibear in #545
- [zero] add zero config to neutralize zero context init by @1SAA in #546
- [zero] dump memory stats for sharded model by @feifeibear in #548
- [zero] add stateful tensor by @feifeibear in #549
- [zero] label state for param fp16 and grad by @feifeibear in #551
- [zero] hijack p.grad in sharded model by @ver217 in #554
- [utils] update colo tensor moving APIs by @feifeibear in #553
- [polish] rename col_attr -> colo_attr by @feifeibear in #558
- [zero] trace states of fp16/32 grad and fp32 param by @ver217 in #571
- [zero] adapt zero for unsharded parameters by @1SAA in #561
- [refactor] memory utils by @feifeibear in #577
- Feature/checkpoint gloo by @kurisusnowdeng in #589
- [zero] add sampling time for memstats collector by @Gy-Lu in #610
- [model checkpoint] checkpoint utils by @kurisusnowdeng in #592
- [model checkpoint][hotfix] unified layers for save&load by @kurisusnowdeng in #593
- Feature/checkpoint 2D by @kurisusnowdeng in #595
- Feature/checkpoint 1D by @kurisusnowdeng in #594
- [model checkpoint] CPU communication ops by @kurisusnowdeng in #590
- Feature/checkpoint 2.5D by @kurisusnowdeng in #596
- Feature/Checkpoint 3D by @kurisusnowdeng in #597
- [model checkpoint] checkpoint hook by @kurisusnowdeng in #598
- Feature/Checkpoint tests by @kurisusnowdeng in #599
- [zero] adapt zero for unsharded parameters (Optimizer part) by @1SAA in #601
- [zero] polish init context by @feifeibear in #645
- refactor pipeline---put runtime schedule into engine. by @YuliangLiu0306 in #627
Bug Fix
- [Zero] process no-leaf-module in Zero by @1SAA in #535
- Add gather_out arg to Linear by @Wesley-Jzy in #541
- [hoxfix] fix parallel_input flag for Linear1D_Col gather_output by @Wesley-Jzy in #579
- [hotfix] add hybrid adam to init by @ver217 in #584
- Hotfix/path check util by @kurisusnowdeng in #591
- [hotfix] fix sharded optim zero grad by @ver217 in #604
- Add tensor parallel input check by @Wesley-Jzy in #621
- [hotfix] Raise messages for indivisible batch sizes with tensor parallelism by @number1roy in #622
- [zero] fixed the activation offload by @Gy-Lu in #647
- fixed bugs in CPU adam by @1SAA in #633
- Revert "[zero] polish init context" by @feifeibear in #657
- [hotfix] fix a bug in model data stats tracing by @feifeibear in #655
- fix bugs for unsharded parameters when restore data by @1SAA in #664
Unit Testing
- [zero] test zero tensor utils by @FredHuang99 in #609
- remove hybrid adam in test_moe_zero_optim by @1SAA in #659
Documentation
- Refactored docstring to google style by @number1roy in #532
- [docs] updatad docs of hybrid adam and cpu adam by @Gy-Lu in #552
- html refactor by @number1roy in #555
- [doc] polish docstring of zero by @ver217 in #612
- [doc] update rst by @ver217 in #615
- [doc] polish amp docstring by @ver217 in #616
- [doc] polish moe docsrting by @ver217 in #618
- [doc] polish optimizer docstring by @ver217 in #619
- [doc] polish utils docstring by @ver217 in #620
- [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu … by @GaryGky in #625
- [doc] polish checkpoint docstring by @ver217 in #637
- update GPT-2 experiment result by @Sze-qq in #666
- [NFC] polish code by @binmakeswell in #646
Model Zoo
Miscellaneous
- [logging] polish logger format by @feifeibear in #543
- [profiler] add MemProfiler by @raejaf in #356
- [Bot] Synchronize Submodule References by @github-actions in #501
- [tool] create .clang-format for pre-commit by @BoxiangW in #578
- [GitHub] Add prefix and label in issue template by @binmakeswell in #652
Full Changelog: v0.1.1...v0.1.2
V0.1.1 Released Today!
What's Changed
Features
- [MOE] changed parallelmode to dist process group by @1SAA in #460
- [MOE] redirect moe_env from global_variables to core by @1SAA in #467
- [zero] zero init ctx receives a dp process group by @ver217 in #471
- [zero] ZeRO supports pipeline parallel by @ver217 in #477
- add LinearGate for MOE in NaiveAMP context by @1SAA in #480
- [zero] polish sharded param name by @feifeibear in #484
- [zero] sharded optim support hybrid cpu adam by @ver217 in #486
- [zero] polish sharded optimizer v2 by @ver217 in #490
- [MOE] support PR-MOE by @1SAA in #488
- [zero] sharded model manages ophooks individually by @ver217 in #492
- [MOE] remove old MoE legacy by @1SAA in #493
- [zero] sharded model support the reuse of fp16 shard by @ver217 in #495
- [polish] polish singleton and global context by @feifeibear in #500
- [memory] add model data tensor moving api by @feifeibear in #503
- [memory] set cuda mem frac by @feifeibear in #506
- [zero] use colo model data api in sharded optimv2 by @feifeibear in #511
- [MOE] add MOEGPT model by @1SAA in #510
- [zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in #512
- [zero] show model data cuda memory usage after zero context init. by @feifeibear in #515
- [log] polish disable_existing_loggers by @ver217 in #519
- [zero] add model data tensor inline moving API by @feifeibear in #521
- [cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in #497
- [zero] refactor model data tracing by @feifeibear in #522
- [zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in #527
Bug Fix
- fix discussion buttom in issue template by @binmakeswell in #504
- [zero] fix grad offload by @feifeibear in #528
Unit Testing
- [MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in #469
- [test] added rerun on exception for testing by @FrankLeeeee in #475
- [zero] fix init device bug in zero init context unittest by @feifeibear in #516
- [test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in #487
CI/CD
- [devops] remove tsinghua source for pip by @FrankLeeeee in #505
- [devops] remove tsinghua source for pip by @FrankLeeeee in #507
- [devops] recover tsinghua pip source due to proxy issue by @FrankLeeeee in #509
Documentation
- [doc] update rst by @ver217 in #470
- Update Experiment result about Colossal-AI with ZeRO by @Sze-qq in #479
- [doc] docs get correct release version by @ver217 in #489
- Update README.md by @fastalgo in #514
- [doc] update apidoc by @ver217 in #530
Model Zoo
- [model zoo] fix attn mask shape of gpt by @ver217 in #472
- [model zoo] gpt embedding remove attn mask by @ver217 in #474
Miscellaneous
- [install] run with out rich by @feifeibear in #513
- [refactor] remove old zero code by @feifeibear in #517
- [format] polish name format for MOE by @feifeibear in #481
New Contributors
Full Changelog: v0.1.0...v0.1.1
V0.1.0 Released Today!
Overview
We are happy to release the version v0.1.0
today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai
now. We will update our examples and documentation in the next few days accordingly.
Highlights:
Note:
a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
b. Some commits do not have associated pull request ID for some unknown reasons.
c. The list is ordered by time.
Features
- add moe context, moe utilities and refactor gradient handler (#455 )By @1SAA
- [zero] Update initialize for ZeRO (#458 ) By @ver217
- [zero] hybrid cpu adam (#445 ) By @feifeibear
- added Multiply Jitter and capacity factor eval for MOE (#434 ) By @1SAA
- [fp16] refactored fp16 optimizer (#392 ) By @FrankLeeeee
- [zero] memtracer to record cuda memory usage of model data and overall system (#395 ) By @feifeibear
- Added tensor detector (#393 ) By @Gy-Lu
- Added activation offload (#331 ) By @Gy-Lu
- [zero] zero init context collect numel of model (#375 ) By @feifeibear
- Added PCIE profiler to dectect data transmission (#373 ) By @1SAA
- Added Profiler Context to manage all profilers (#340 ) By @1SAA
- set criterion as optional in colossalai initialize (#336 ) By @FrankLeeeee
- [zero] Update sharded model v2 using sharded param v2 (#323 ) By @ver217
- [zero] zero init context (#321 ) By @feifeibear
- Added profiler communication operations By @1SAA
- added buffer sync to naive amp model wrapper (#291 ) By @FrankLeeeee
- [zero] cpu adam kernel (#288 ) By @Gy-Lu
- Feature/zero (#279 ) By @feifeibear @FrankLeeeee @ver217
- impl shard optim v2 and add unit test By @ver217
- [profiler] primary memory tracer By @raejaf
- add sharded adam By @ver217
Unit Testing
- [test] fixed amp convergence comparison test (#454 ) By @FrankLeeeee
- [test] optimized zero data parallel test (#452 ) By @FrankLeeeee
- [test] make zero engine test really work (#447 ) By @feifeibear
- optimized context test time consumption (#446 ) By @FrankLeeeee
- [unitest] polish zero config in unittest (#438 ) By @feifeibear
- added testing module (#435 ) By @FrankLeeeee
- [zero] polish ShardedOptimV2 unittest (#385 ) By @feifeibear
- [unit test] Refactored test cases with component func (#339 ) By @FrankLeeeee
Documentation
- [doc] Update docstring for ZeRO (#459 ) By @ver217
- update README and images path (#384 ) By @binmakeswell
- add badge and contributor list By @FrankLeeeee
- add community group and update issue template (#271 ) By @binmakeswell
- update experimental visualization (#253 ) By @Sze-qq
- add Chinese README By @binmakeswell
CI/CD
- update github CI with the current workflow (#441 ) By @FrankLeeeee
- update unit testing CI rules By @FrankLeeeee
- added compatibility CI and options for release ci By @FrankLeeeee
- added pypi publication CI and remove formatting CI By @FrankLeeeee
Bug Fix
- fix gpt attention mask (#461 ) By @ver217
- [bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee
- fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee
- fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee
- [hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA
- fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee
- [profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA
- fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng
- fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng
Miscellaneous
- [log] better logging display with rich (#426 ) By @feifeibear
V0.0.2 Released Today!
Change Log
Added
- Unifed distributed layers
- MoE support
- DevOps tools such as github action, code review automation, etc.
- New project official website
Changes
- refactored the APIs for usability, flexibility and modularity
- adapted PyTorch AMP for tensor parallel
- refactored utilities for tensor parallel and pipeline parallel
- Separated benchmarks and examples as independent repositories
- Updated pipeline parallelism to support non-interleaved and interleaved versions
- refactored installation scripts for convenience
Fixed
- zero level 3 runtime error
- incorrect calculation in gradient clipping
v0.0.1 Colossal-AI Beta Release
Features
- Data Parallelism
- Pipeline Parallelism (experimental)
- 1D, 2D, 2.5D, 3D and sequence tensor parallelism
- Easy-to-use trainer and engine
- Extensibility for user-defined parallelism
- Mixed Precision Training
- Zero Redundancy Optimizer (ZeRO)