Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoTuner] Add first verison of autotuner #124

Merged
merged 23 commits into from
Jun 6, 2024

Conversation

Caozhou1995
Copy link
Collaborator

@Caozhou1995 Caozhou1995 commented May 27, 2024

This PR adds autotuner module, which can be used with one click by setting action=auto_tune, just like:
python run.py --config-path ./examples/aquila/conf --config-name config action=auto_tune.
AutoTuner currently supports the search of all major parallel strategies, including:

  • data parallel
  • tensor parallel
  • pipeline parallel
  • context parallel
  • expert parallel
  • recompute
  • etc.

AutoTuner is user-friendly, users can add auto_tuner fields on the basis of training yaml to custom, such as follows:

auto_tuner:
  space:
    num_layers_per_virtual_pipeline_stage: [1]
    use_recompute: [false]
  control:
    max_time_per_task: 300
    train_iters: 5
    max_time: 600

Currently we implement a heuristic grid search algorithm with built-in efficient pruning strategies based on historical results, and more search algorithms will be added in the future, so users don't need to care about these parts at present.

Wherespaceis the search space, the user can customize the candidate value of each dimension, if not defined, there will be a default value by framework. We have the following search dimensions built in:

  • data_parallel_size
  • use_distributed_optimizer
  • tensor_model_parallel_size
  • sequence_parallel
  • pipeline_model_parallel_size
  • num_layers_per_virtual_pipeline_stage
  • use_recompute
  • recompute_method
  • recompute_granularity
  • recompute_num_layers
  • micro_batch_size
  • context_parallel_size
  • expert_model_parallel_size

control is used to control the search process, such as the maximum running time of each task, how many steps are run, the maximum running time of autotuner, etc

When the auto tuner running, each task has a corresponding log directory, and the results are summarized and sorted that users only need to look at the csv to know the detailed data for task.

@Caozhou1995 Caozhou1995 force-pushed the base_autotuner branch 6 times, most recently from 69abfe8 to 1e7a558 Compare May 27, 2024 11:03
@Caozhou1995 Caozhou1995 force-pushed the base_autotuner branch 2 times, most recently from 373175b to 6a2c26f Compare May 28, 2024 10:03
@Caozhou1995 Caozhou1995 force-pushed the base_autotuner branch 2 times, most recently from b318cad to 47d89ec Compare May 30, 2024 12:25
@Caozhou1995 Caozhou1995 force-pushed the base_autotuner branch 4 times, most recently from 8484221 to 4e7b289 Compare June 5, 2024 06:09
phoenixdong
phoenixdong previously approved these changes Jun 5, 2024
Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aoyulong aoyulong merged commit ac373cb into FlagOpen:main Jun 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants