Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerEngine Integration #1282

Merged
merged 30 commits into from
Dec 19, 2024
Merged

Conversation

aurelion-source
Copy link
Contributor

Adds:

  • TELinear
  • TELayerNormMLP
  • TEColumnParallelLinear
  • TERowParallelLinear
  • TEMultiheadAttention
  • TEDelayedScaling (in progress)

@CLAassistant
Copy link

CLAassistant commented Sep 16, 2024

CLA assistant check
All committers have signed the CLA.

@Quentin-Anthony
Copy link
Member

@aurelion-source

I tried to run this with an NGC container (specifically nvcr.io/nvidia/pytorch:23.10-py3)

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                     
    main()                                                                                                                                                                                                                             
  File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                         
    pretrain(neox_args=neox_args)                                                                                                                                                                                                      
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                        
    model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                       
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                      
    model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                        
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model                                                                                                                                                       
    model = GPT2ModelPipe(                                                                                                                                                                                                             
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                                                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()                                                                                         
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build                                                                                                                                   
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention                
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 97, in <module>                                                                                                                                         
    class TELinear(te.pytorch.Linear):                                                                             
AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Updating the import to import transformer_engine.pytorch as te updates my error to:

Traceback (most recent call last):                                                                                                                                                                                                       File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                         main()                                                                                                                                                                                                                               File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                             pretrain(neox_args=neox_args)                                                                                                                                                                                                        File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                            model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                         File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                          model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                          File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model                                                                                                                                                       
    model = GPT2ModelPipe(                          
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                           
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()                               
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()                      
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 45, in <module>
    import transformer_engine.pytorch as te
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 63, in <module>
    _load_library()
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 55, in _load_library
    so_path = next(so_dir.glob(f"{module_name}.*.{extension}"))
StopIteration

Basically the same as NVIDIA/TransformerEngine#1014. It's a version issue with TE 1.9.0 and 1.10.0. It would be preferable to allow these versions to function so that NGC containers and pip installs can be used.

Even using pip install -r requirements/requirements-transformerengine.txt for latest stable source-build fails with:

Traceback (most recent call last):                                                                                                                                                                                                       File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                         main()                                                                                                                                                                                                                               File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                             pretrain(neox_args=neox_args)                                                                                                                                                                                                        File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                            model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                         File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                          model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                        
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model
    model = GPT2ModelPipe(                      
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 97, in <module>
    class TELinear(te.pytorch.Linear):
AttributeError: module 'transformer_engine.pytorch' has no attribute 'pytorch'. Did you mean: 'torch'?

@aurelion-source
Copy link
Contributor Author

@aurelion-source

I tried to run this with an NGC container (specifically nvcr.io/nvidia/pytorch:23.10-py3)

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                     
    main()                                                                                                                                                                                                                             
  File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                         
    pretrain(neox_args=neox_args)                                                                                                                                                                                                      
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                        
    model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                       
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                      
    model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                        
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model                                                                                                                                                       
    model = GPT2ModelPipe(                                                                                                                                                                                                             
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                                                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()                                                                                         
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build                                                                                                                                   
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention                
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 97, in <module>                                                                                                                                         
    class TELinear(te.pytorch.Linear):                                                                             
AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Updating the import to import transformer_engine.pytorch as te updates my error to:

Traceback (most recent call last):                                                                                                                                                                                                       File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                         main()                                                                                                                                                                                                                               File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                             pretrain(neox_args=neox_args)                                                                                                                                                                                                        File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                            model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                         File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                          model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                          File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model                                                                                                                                                       
    model = GPT2ModelPipe(                          
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                           
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()                               
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()                      
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 45, in <module>
    import transformer_engine.pytorch as te
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 63, in <module>
    _load_library()
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/__init__.py", line 55, in _load_library
    so_path = next(so_dir.glob(f"{module_name}.*.{extension}"))
StopIteration

Basically the same as NVIDIA/TransformerEngine#1014. It's a version issue with TE 1.9.0 and 1.10.0. It would be preferable to allow these versions to function so that NGC containers and pip installs can be used.

Even using pip install -r requirements/requirements-transformerengine.txt for latest stable source-build fails with:

Traceback (most recent call last):                                                                                                                                                                                                       File "/workspace/gpt-neox-nawras/train.py", line 35, in <module>                                                                                                                                                                         main()                                                                                                                                                                                                                               File "/workspace/gpt-neox-nawras/train.py", line 31, in main                                                                                                                                                                             pretrain(neox_args=neox_args)                                                                                                                                                                                                        File "/workspace/gpt-neox-nawras/megatron/training.py", line 251, in pretrain                                                                                                                                                            model, optimizer, lr_scheduler, reference_model = setup_model_and_optimizer(                                                                                                                                                         File "/workspace/gpt-neox-nawras/megatron/training.py", line 1153, in setup_model_and_optimizer                                                                                                                                          model = get_model(neox_args=neox_args, use_cache=use_cache)                                                                                                                                                                        
  File "/workspace/gpt-neox-nawras/megatron/training.py", line 882, in get_model
    model = GPT2ModelPipe(                      
  File "/workspace/gpt-neox-nawras/megatron/model/gpt2_model.py", line 131, in __init__
    super().__init__(                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/workspace/gpt-neox-nawras/megatron/model/transformer.py", line 1030, in __init__
    from megatron.model.transformer_engine import TEMultiheadAttention
  File "/workspace/gpt-neox-nawras/megatron/model/transformer_engine.py", line 97, in <module>
    class TELinear(te.pytorch.Linear):
AttributeError: module 'transformer_engine.pytorch' has no attribute 'pytorch'. Did you mean: 'torch'?

@Quentin-Anthony
Based on NVIDIA/TransformerEngine#1014 (comment), the issue is due to TE not detecting PyTorch during the build process. It's not a version issue.

The suggest solution NVTE_FRAMEWORK=pytorch pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable works fine. However, this requires setting an environment variable before the pip command.

Alternatively, pip install transformer-engine[pytorch] forces TE to build with PyTorch support as well.
I've updated and tested ./requirements/requirements-transformer-engine.txt accordingly.

@Quentin-Anthony Quentin-Anthony merged commit 8900d05 into EleutherAI:main Dec 19, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants