[AutoTuner] Add first verison of autotuner #124

Caozhou1995 · 2024-05-27T09:42:05Z

This PR adds autotuner module, which can be used with one click by setting action=auto_tune, just like:
python run.py --config-path ./examples/aquila/conf --config-name config action=auto_tune.
AutoTuner currently supports the search of all major parallel strategies, including:

data parallel
tensor parallel
pipeline parallel
context parallel
expert parallel
recompute
etc.

AutoTuner is user-friendly, users can add auto_tuner fields on the basis of training yaml to custom, such as follows:

auto_tuner:
  space:
    num_layers_per_virtual_pipeline_stage: [1]
    use_recompute: [false]
  control:
    max_time_per_task: 300
    train_iters: 5
    max_time: 600

Currently we implement a heuristic grid search algorithm with built-in efficient pruning strategies based on historical results, and more search algorithms will be added in the future, so users don't need to care about these parts at present.

Wherespaceis the search space, the user can customize the candidate value of each dimension, if not defined, there will be a default value by framework. We have the following search dimensions built in:

data_parallel_size
use_distributed_optimizer
tensor_model_parallel_size
sequence_parallel
pipeline_model_parallel_size
num_layers_per_virtual_pipeline_stage
use_recompute
recompute_method
recompute_granularity
recompute_num_layers
micro_batch_size
context_parallel_size
expert_model_parallel_size

control is used to control the search process, such as the maximum running time of each task, how many steps are run, the maximum running time of autotuner, etc

When the auto tuner running, each task has a corresponding log directory, and the results are summarized and sorted that users only need to look at the csv to know the detailed data for task.

aoyulong

LGTM

Caozhou1995 force-pushed the base_autotuner branch 6 times, most recently from 69abfe8 to 1e7a558 Compare May 27, 2024 11:03

caozhou added 4 commits May 28, 2024 17:56

add query mode for runner

4d66348

add the first version of autotuner

d3701d9

change autotuner task directory

68ba326

fix priority sort bug

98ef573

Caozhou1995 force-pushed the base_autotuner branch 2 times, most recently from 373175b to 6a2c26f Compare May 28, 2024 10:03

fix some bugs

da75397

Caozhou1995 force-pushed the base_autotuner branch from 6a2c26f to da75397 Compare May 28, 2024 10:51

caozhou added 3 commits May 30, 2024 18:28

fix prune bugs and add sp prune and add checkout mode

8441381

add first task time to 600 for data process

cf57d7d

add get best function

dbcc79f

Caozhou1995 force-pushed the base_autotuner branch from fe0fb2d to f98388a Compare May 30, 2024 11:33

update performance mode

1255ade

Caozhou1995 force-pushed the base_autotuner branch 2 times, most recently from b318cad to 47d89ec Compare May 30, 2024 12:25

update recompute num

ad10e26

Caozhou1995 force-pushed the base_autotuner branch from 47d89ec to ad10e26 Compare May 30, 2024 12:26

caozhou added 5 commits May 30, 2024 20:52

update auto tuner envs

1df1279

process platform envs and record start time of each task

1d55fe7

run best task

44331f7

update tuner config

fa85bd7

update autotuner env

245e6c1

Caozhou1995 force-pushed the base_autotuner branch from d27d136 to 245e6c1 Compare May 31, 2024 04:20

Caozhou1995 mentioned this pull request Jun 4, 2024

[Runner] Add query mode #113

Closed

Caozhou1995 force-pushed the base_autotuner branch from 95b63f0 to c392027 Compare June 4, 2024 07:16

update with platform

32b3e67

Caozhou1995 force-pushed the base_autotuner branch from c392027 to 32b3e67 Compare June 4, 2024 07:18

set context parallel default value

d9e80ad

Caozhou1995 force-pushed the base_autotuner branch 4 times, most recently from 8484221 to 4e7b289 Compare June 5, 2024 06:09

add mpirun mode

0fb14a3

Caozhou1995 force-pushed the base_autotuner branch from 4e7b289 to 0fb14a3 Compare June 5, 2024 06:12

fix master bug

53e9dcc

Caozhou1995 force-pushed the base_autotuner branch from 2eb7074 to 53e9dcc Compare June 5, 2024 08:56

phoenixdong previously approved these changes Jun 5, 2024

View reviewed changes

Caozhou1995 dismissed phoenixdong’s stale review via 7199af2 June 6, 2024 01:32

Caozhou1995 force-pushed the base_autotuner branch 2 times, most recently from 5778d21 to 08efd6e Compare June 6, 2024 02:48

add autotuner example and args

d9af983

Caozhou1995 force-pushed the base_autotuner branch from 08efd6e to d9af983 Compare June 6, 2024 03:31

update runner to monitor job

e5e9c7d

Caozhou1995 force-pushed the base_autotuner branch from c1f9e1e to e5e9c7d Compare June 6, 2024 08:48

caozhou and others added 2 commits June 6, 2024 17:14

add monitor interval

bbbf310

Merge branch 'main' into base_autotuner

179bd44

aoyulong approved these changes Jun 6, 2024

View reviewed changes

aoyulong merged commit ac373cb into FlagOpen:main Jun 6, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoTuner] Add first verison of autotuner #124

[AutoTuner] Add first verison of autotuner #124

Caozhou1995 commented May 27, 2024 •

edited

Loading

aoyulong left a comment

[AutoTuner] Add first verison of autotuner #124

[AutoTuner] Add first verison of autotuner #124

Conversation

Caozhou1995 commented May 27, 2024 • edited Loading

aoyulong left a comment

Choose a reason for hiding this comment

Caozhou1995 commented May 27, 2024 •

edited

Loading