Skip to content

Releases: hpcaitech/ColossalAI

v0.1.4 Released!

28 Apr 07:56
e1108ca
Compare
Choose a tag to compare

Main Features

Here are the main improvements of this release:

  1. ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
  2. Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
  3. CLI: a command-line tool that helps users launch distributed training tasks more easily.
  4. Pipeline Parallelism (PP): a more user-friendly API for PP.

What's Changed

ColoTensor

Gemini + ZeRO

  • [zero] add zero tensor shard strategy by @1SAA in #793
  • Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
  • [gemini] a new tensor structure by @feifeibear in #818
  • [gemini] APIs to set cpu memory capacity by @feifeibear in #809
  • [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
  • [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
  • [gemini] add GeminiMemoryManger by @1SAA in #832
  • [zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
  • [gemini] polish code by @1SAA in #855
  • [gemini] add stateful tensor container by @1SAA in #867
  • [gemini] polish stateful_tensor_mgr by @1SAA in #876
  • [gemini] accelerate adjust_layout() by @ver217 in #878

CLI

Pipeline Parallelism

Misc

  • [hotfix] fix auto tensor placement policy by @ver217 in #775
  • [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
  • [hotfix] fix bugs in zero by @1SAA in #781
  • [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
  • [refactor] moving memtracer to gemini by @feifeibear in #801
  • [log] display tflops if available by @feifeibear in #802
  • [refactor] moving grad acc logic to engine by @feifeibear in #804
  • [log] local throughput metrics by @feifeibear in #811
  • [Bot] Synchronize Submodule References by @github-actions in #810
  • [Bot] Synchronize Submodule References by @github-actions in #819
  • [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
  • [setup] allow installation with python 3.6 by @FrankLeeeee in #834
  • Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
  • [dependency] removed torchvision by @FrankLeeeee in #833
  • [Bot] Synchronize Submodule References by @github-actions in #827
  • [unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
  • [setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
  • [hotfix] ColoTensor pin_memory by @feifeibear in #840
  • modefied the pp build for ckpt adaptation by @Gy-Lu in #803
  • [hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
  • [hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
  • [hotfix] add deconstructor for stateful tensor by @ver217 in #848
  • [utils] refactor profiler by @ver217 in #837
  • [ci] cache cuda extension by @FrankLeeeee in #860
  • hotfix tensor unittest bugs by @feifeibear in #862
  • [usability] added assertion message in registry by @FrankLeeeee in #864
  • [doc] improved docstring in the communication module by @FrankLeeeee in #863
  • [doc] improved docstring in the logging module by @FrankLeeeee in #861
  • [doc] improved docstring in the amp module by @FrankLeeeee in #857
  • [usability] improved error messages in the context modu...
Read more

V0.1.3 Released!

16 Apr 09:13
38102cf
Compare
Choose a tag to compare

Overview

Here are the main improvements of this release:

  1. Gemini: Heterogeneous memory space manager
  2. Refactor the API of pipeline parallelism

What's Changed

Features

  • [zero] initialize a stateful tensor manager by @feifeibear in #614
  • [pipeline] refactor pipeline by @YuliangLiu0306 in #679
  • [zero] stateful tensor manager by @ver217 in #687
  • [zero] adapt zero hooks for unsharded module by @1SAA in #699
  • [zero] refactor memstats collector by @ver217 in #706
  • [zero] improve adaptability for not-shard parameters by @1SAA in #708
  • [zero] check whether gradients have inf and nan in gpu by @1SAA in #712
  • [refactor] refactor the memory utils by @feifeibear in #715
  • [util] support detection of number of processes on current node by @FrankLeeeee in #723
  • [utils] add synchronized cuda memory monitor by @1SAA in #740
  • [zero] refactor ShardedParamV2 by @1SAA in #742
  • [zero] add tensor placement policies by @ver217 in #743
  • [zero] use factory pattern for tensor_placement_policy by @feifeibear in #752
  • [zero] refactor memstats_collector by @1SAA in #746
  • [gemini] init genimi individual directory by @feifeibear in #754
  • refactor shard and gather operation by @1SAA in #773

Bug Fix

  • [zero] fix init bugs in zero context by @1SAA in #686
  • [hotfix] update requirements-test by @ver217 in #701
  • [hotfix] fix a bug in 3d vocab parallel embedding by @kurisusnowdeng in #707
  • [compatibility] fixed tensor parallel compatibility with torch 1.9 by @FrankLeeeee in #700
  • [hotfix]fixed bugs of assigning grad states to non leaf nodes by @Gy-Lu in #711
  • [hotfix] fix stateful tensor manager's cuda model data size by @ver217 in #710
  • [bug] fixed broken test_found_inf by @FrankLeeeee in #725
  • [util] fixed activation checkpointing on torch 1.9 by @FrankLeeeee in #719
  • [util] fixed communication API with PyTorch 1.9 by @FrankLeeeee in #721
  • [bug] removed zero installation requirements by @FrankLeeeee in #731
  • [hotfix] remove duplicated param register to stateful tensor manager by @feifeibear in #728
  • [utils] correct cpu memory used and capacity in the context of multi-process by @feifeibear in #726
  • [bug] fixed grad scaler compatibility with torch 1.8 by @FrankLeeeee in #735
  • [bug] fixed DDP compatibility with torch 1.8 by @FrankLeeeee in #739
  • [hotfix] fix memory leak in backward of sharded model by @ver217 in #741
  • [hotfix] fix initialize about zero by @ver217 in #748
  • [hotfix] fix prepare grads in sharded optim by @ver217 in #749
  • [hotfix] layernorm by @kurisusnowdeng in #750
  • [hotfix] fix auto tensor placement policy by @ver217 in #753
  • [hotfix] fix reuse_fp16_shard of sharded model by @ver217 in #756
  • [hotfix] fix test_stateful_tensor_mgr by @ver217 in #762
  • [compatibility] used backward-compatible API for global process group by @FrankLeeeee in #758
  • [hotfix] fix the ckpt hook bugs when using DDP by @Gy-Lu in #769
  • [hotfix] polish sharded optim docstr and warning by @ver217 in #770

Unit Testing

Documentation

Miscellaneous

  • [Bot] Synchronize Submodule References by @github-actions in #556
  • [Bot] Synchronize Submodule References by @github-actions in #695
  • [refactor] zero directory by @feifeibear in #724
  • [Bot] Synchronize Submodule References by @github-actions in #751

Full Changelog: v0.1.2...v0.1.3

V0.1.2 Released!

06 Apr 05:45
03e1d35
Compare
Choose a tag to compare

Overview

Here are the main improvements of this release:

  1. MOE and BERT models can be trained with ZeRO.
  2. Provide a uniform checkpoint for all kinds of parallelism.
  3. Optimize ZeRO-offload, and improve model scaling.
  4. Design a uniform model memory tracer.
  5. Implement an efficient hybrid Adam (CPU and CUDA kernels).
  6. Improve activation offloading.
  7. Profiler TensorBoard plugin of Beta version.
  8. Refactor pipeline module for closer integration with engine.
  9. Chinese tutorials, WeChat and Slack user groups.

What's Changed

Features

Bug Fix

Unit Testing

Documentation

Model Zoo

  • [model zoo] add activation offload for gpt model by @Gy-Lu in #582

Miscellaneous

  • [logging] polish logger format by @feifeibear in #543
  • [profiler] add MemProfiler by @raejaf in #356
  • [Bot] Synchronize Submodule References by @github-actions in #501
  • [tool] create .clang-format for pre-commit by @BoxiangW in #578
  • [GitHub] Add prefix and label in issue template by @binmakeswell in #652

Full Changelog: v0.1.1...v0.1.2

V0.1.1 Released Today!

26 Mar 07:19
56ad945
Compare
Choose a tag to compare

What's Changed

Features

  • [MOE] changed parallelmode to dist process group by @1SAA in #460
  • [MOE] redirect moe_env from global_variables to core by @1SAA in #467
  • [zero] zero init ctx receives a dp process group by @ver217 in #471
  • [zero] ZeRO supports pipeline parallel by @ver217 in #477
  • add LinearGate for MOE in NaiveAMP context by @1SAA in #480
  • [zero] polish sharded param name by @feifeibear in #484
  • [zero] sharded optim support hybrid cpu adam by @ver217 in #486
  • [zero] polish sharded optimizer v2 by @ver217 in #490
  • [MOE] support PR-MOE by @1SAA in #488
  • [zero] sharded model manages ophooks individually by @ver217 in #492
  • [MOE] remove old MoE legacy by @1SAA in #493
  • [zero] sharded model support the reuse of fp16 shard by @ver217 in #495
  • [polish] polish singleton and global context by @feifeibear in #500
  • [memory] add model data tensor moving api by @feifeibear in #503
  • [memory] set cuda mem frac by @feifeibear in #506
  • [zero] use colo model data api in sharded optimv2 by @feifeibear in #511
  • [MOE] add MOEGPT model by @1SAA in #510
  • [zero] zero init ctx enable rm_torch_payload_on_the_fly by @ver217 in #512
  • [zero] show model data cuda memory usage after zero context init. by @feifeibear in #515
  • [log] polish disable_existing_loggers by @ver217 in #519
  • [zero] add model data tensor inline moving API by @feifeibear in #521
  • [cuda] modify the fused adam, support hybrid of fp16 and fp32 by @Gy-Lu in #497
  • [zero] refactor model data tracing by @feifeibear in #522
  • [zero] added hybrid adam, removed loss scale in adam by @Gy-Lu in #527

Bug Fix

Unit Testing

  • [MOE] add unitest for MOE experts layout, gradient handler and kernel by @1SAA in #469
  • [test] added rerun on exception for testing by @FrankLeeeee in #475
  • [zero] fix init device bug in zero init context unittest by @feifeibear in #516
  • [test] fixed rerun_on_exception and adapted test cases by @FrankLeeeee in #487

CI/CD

Documentation

Model Zoo

  • [model zoo] fix attn mask shape of gpt by @ver217 in #472
  • [model zoo] gpt embedding remove attn mask by @ver217 in #474

Miscellaneous

New Contributors

Full Changelog: v0.1.0...v0.1.1

V0.1.0 Released Today!

19 Mar 03:18
8f9617c
Compare
Choose a tag to compare

Overview

We are happy to release the version v0.1.0 today. Compared to the previous version, we have a brand new zero module and updated many aspects of our system for better performance and usability. The latest version can be installed by pip install colossalai now. We will update our examples and documentation in the next few days accordingly.

Highlights:

Note:
a. Only the major base commits are chosen to display. Successive commits which enhance/update the base commit are not shown.
b. Some commits do not have associated pull request ID for some unknown reasons.
c. The list is ordered by time.

Features

Unit Testing

Documentation

CI/CD

Bug Fix

  • fix gpt attention mask (#461 ) By @ver217
  • [bug] Fixed device placement bug in memory monitor thread (#433 ) By @FrankLeeeee
  • fixed fp16 optimizer none grad bug (#432 ) By @FrankLeeeee
  • fixed gpt attention mask in pipeline (#430 ) By @FrankLeeeee
  • [hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394 ) By @1SAA
  • fixed bug in activation checkpointing test (#387 ) By @FrankLeeeee
  • [profiler] Fixed bugs in CommProfiler and PcieProfiler (#377 ) By @1SAA
  • fixed CI dataset directory; fixed import error of 2.5d accuracy (#255 ) By @kurisusnowdeng
  • fixed padding index issue for vocab parallel embedding layers; updated 3D linear to be compatible with examples in the tutorial By @kurisusnowdeng

Miscellaneous

V0.0.2 Released Today!

15 Feb 03:31
Compare
Choose a tag to compare

Change Log

Added

  • Unifed distributed layers
  • MoE support
  • DevOps tools such as github action, code review automation, etc.
  • New project official website

Changes

  • refactored the APIs for usability, flexibility and modularity
  • adapted PyTorch AMP for tensor parallel
  • refactored utilities for tensor parallel and pipeline parallel
  • Separated benchmarks and examples as independent repositories
  • Updated pipeline parallelism to support non-interleaved and interleaved versions
  • refactored installation scripts for convenience

Fixed

  • zero level 3 runtime error
  • incorrect calculation in gradient clipping

v0.0.1 Colossal-AI Beta Release

28 Oct 16:49
3245a69
Compare
Choose a tag to compare

Features

  • Data Parallelism
  • Pipeline Parallelism (experimental)
  • 1D, 2D, 2.5D, 3D and sequence tensor parallelism
  • Easy-to-use trainer and engine
  • Extensibility for user-defined parallelism
  • Mixed Precision Training
  • Zero Redundancy Optimizer (ZeRO)