Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoTuner]Add memory model #147

Merged
merged 3 commits into from
Jun 25, 2024
Merged

Conversation

Caozhou1995
Copy link
Collaborator

@Caozhou1995 Caozhou1995 commented Jun 13, 2024

This PR adds memory model, which can be used to speed up pruning and filter out OOM memory strategies and strategies with low memory usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to reuse this impl

Copy link
Collaborator Author

@Caozhou1995 Caozhou1995 Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl has been reused and the activation section has been refined.



def prune_by_memory_model_util(config, strategy, history=[]):
if "modeling_memory" in strategy:
Copy link
Contributor

@aoyulong aoyulong Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to rename "modeling_memory" to "memory_model" as the other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, done

Comment on lines 80 to 97
if os.environ.get("AIRS_ACCELERATOR_COUNT", None):
self.config.experiment.auto_tuner.nproc_per_node = int(
os.environ["AIRS_ACCELERATOR_COUNT"]
# Set config
self.config.experiment.auto_tuner.nproc_per_node = (
int(os.environ["AIRS_ACCELERATOR_COUNT"]) * 2
if "luvatar_BI" in os.environ["AIRS_ACCELERATOR_MODEL"]
else int(os.environ["AIRS_ACCELERATOR_COUNT"])
)
# Set original config
self.orig_config.experiment.runner.nproc_per_node = int(
os.environ["AIRS_ACCELERATOR_COUNT"]
self.orig_config.experiment.runner.nproc_per_node = (
int(os.environ["AIRS_ACCELERATOR_COUNT"]) * 2
if "luvatar_BI" in os.environ["AIRS_ACCELERATOR_MODEL"]
else int(os.environ["AIRS_ACCELERATOR_COUNT"])
)
# Set config
self.config.experiment.runner.nproc_per_node = int(
os.environ["AIRS_ACCELERATOR_COUNT"]
self.config.experiment.runner.nproc_per_node = (
int(os.environ["AIRS_ACCELERATOR_COUNT"]) * 2
if "luvatar_BI" in os.environ["AIRS_ACCELERATOR_MODEL"]
else int(os.environ["AIRS_ACCELERATOR_COUNT"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to move these platform related code into a standalone place? We may support differnt cloud platform.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The platform code has been removed to platform.py and different platforms code will be in this file.

@Caozhou1995 Caozhou1995 force-pushed the memory_model branch 5 times, most recently from d052639 to 69510a0 Compare June 24, 2024 12:25
aoyulong
aoyulong previously approved these changes Jun 25, 2024
@@ -0,0 +1,351 @@
"""
Computes theoretical memory footprint for model training referring to megatron.
Activation memory is optimized with adding block recompute formula.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add reference to megatron original impl

@@ -161,3 +167,114 @@ def compare_by_recompute(strategy1, strategy2):
result = True

return result


def convert_config_to_megatron_args(config, strategy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any simpler way to impl this?

Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aoyulong aoyulong merged commit f47e6d5 into FlagOpen:main Jun 25, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants