Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate LLaVA for multimodal pre-training #781

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Integrate LLaVA for multimodal pre-training #781

wants to merge 10 commits into from

Conversation

winglian
Copy link
Collaborator

@winglian winglian commented Oct 24, 2023

you'll need to download the images.zip from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/tree/main into a llava folder to use this

this PR simply mostly reimplements this file https://github.com/haotian-liu/LLaVA/blob/66044b727e30f589c6dbf7b58fce021b73566b36/llava/train/train.py

@winglian winglian added enhancement New feature or request wip labels Oct 24, 2023
@winglian winglian marked this pull request as draft October 24, 2023 03:26
@winglian
Copy link
Collaborator Author

Anyone have any ideas around this stack trace?

  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop                                                                                                                                                                              
    tr_loss_step = self.training_step(model, inputs)                                                                                                                                                                                                                                            
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                            
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2776, in training_step                                                                                                                                                                                     
    loss = self.compute_loss(model, inputs)                                                                                                                                                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                     
  File "/workspace/axolotl/src/axolotl/core/trainer_builder.py", line 252, in compute_loss                                                                                                                                                                                                      
    return super().compute_loss(model, inputs, return_outputs=return_outputs)                                                                                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                   
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2801, in compute_loss                                                                                                                                                                                      
    outputs = model(**inputs)                                                                                                                                                                                                                                                                   
              ^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                   
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                                                                                                     
    return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                        
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward                                                                                                                                                                                  
    output = self._run_ddp_forward(*inputs, **kwargs)                                                                                                                                                                                                                                           
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                           
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward                                                                                                                                                                         
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]                                                                                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                               
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                                                                                                                                                                     
    return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                        
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 636, in forward                                                                                                                                                                                     
    return model_forward(*args, **kwargs)                                                                                                                                                                                                                                                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                       
  File "/root/miniconda3/lib/python3.11/site-packages/accelerate/utils/operations.py", line 624, in __call__                                                                                                                                                                                    
    return convert_to_fp32(self.model_forward(*args, **kwargs))                                                                                                                                                                                                                                 
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                  
  File "/root/miniconda3/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)                                   
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/workspace/axolotl/src/axolotl/models/llava/llava_mistral.py", line 99, in forward
    outputs = self.model(                                          
              ^^^^^^^^^^^                              
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)               
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py", line 863, in forward
    inputs_embeds = self.embed_tokens(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
           ^^^^^^^^^^^^          
  File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                        
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
                                                                        
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc4206ff4d7 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc4206c936b in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc3f633fb58 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fc38a3eeee0 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fc38a3f24b8 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7fc38a3f3a07 in /root/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fc3f5ab0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94b43 (0x7fc420f32b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7fc420fc3bb4 in /lib/x86_64-linux-gnu/libc.so.6)

@winglian
Copy link
Collaborator Author

:1146: block: [17                                                                                                                                                                                                                                                                               
,0,0: indexSelectLargeIndex], thread: [53: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [3` failed.                                                                                                                                                                    
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [54: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [4` failed.                                                                                                           
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [55: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [5` failed.                                                                                                           
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [56: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [6` failed.                                                                                                           
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [57: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [7` failed.                                                                                                           
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [58: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [8` failed.                                                                                                           
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [59: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                               
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex` failed.                                                                                                                                                                                                                   
: block: [602,0,0], thread: [11../aten/src/ATen/native/cuda/Indexing.cu,0,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                        
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [60: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [12` failed.                                                                                                          
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [61: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [13` failed.                                                                                                          
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [62: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [14` failed.                                                                                                          
,0../aten/src/ATen/native/cuda/Indexing.cu,0:1146] Assertion `srcIndex < srcSelectDimSize: indexSelectLargeIndex` failed.                                                                                                                                                                       
: block: [17../aten/src/ATen/native/cuda/Indexing.cu,0:1146,0: indexSelectLargeIndex], thread: [63: block: [602,0,0,0,0] Assertion `srcIndex < srcSelectDimSize], thread: [15` failed.                                                                                                          
,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                                                                                                                           
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [602,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                                                                                                                                        
../aten/src/ATen/native/cuda/Indexing.cuterminate called after throwing an instance of 'c10::Error'                                                                                                                                                                                             
:1146: indexSelectLargeIndex: block: [602,0,0  what():  CUDA error: device-side assert triggered                                                                                                                                                                                                
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                   

@ehartford
Copy link
Collaborator

Gpt4 says

It looks like you're encountering a CUDA error related to indexing in PyTorch. This error is often caused by an invalid index being used to access tensor elements.

Here's a breakdown of the issue:

  1. The error originates from the Indexing.cu file, which is part of the ATen library in PyTorch. This library contains CUDA implementations for tensor operations.
  2. The assertion that's failing is srcIndex < srcSelectDimSize. This suggests that an index (srcIndex) being used to access a tensor is larger than the size of the dimension it's trying to access (srcSelectDimSize).
  3. This error is being triggered multiple times, which could be due to a loop or batch processing.

To troubleshoot and resolve this issue:

  1. Check the Indexing: Ensure that all tensor indexing operations in your code are within valid bounds. For example, if you're trying to access the 10th element of a tensor that only has 9 elements, you'll encounter this error.

  2. Review the Dimensions: Verify the dimensions of tensors you're working with, especially if they're being passed through functions or reshaped. Using the .size() method on a tensor can help you see its dimensions.

  3. Use Device-Side Assertions: The error message suggests compiling with TORCH_USE_CUDA_DSA to enable device-side assertions. This can give more detailed error messages that can help pinpoint the exact location and cause of the problem.

  4. Update PyTorch: Sometimes, issues can be resolved by simply updating to the latest version of PyTorch.

  5. Minimal Reproduction: If you're still stuck, try to create a minimal code example that reproduces the error. This can help you isolate the issue and might make it easier for others to assist you.

Remember, this type of error is almost always related to incorrect indexing. Start by reviewing any indexing operations, slicing, or other tensor manipulations in your code.

@ehartford
Copy link
Collaborator

Maybe you could try using nightly cuda and pytorch?

@winglian
Copy link
Collaborator Author

winglian commented Oct 28, 2023

adding some notes here from troubleshooting:

@winglian
Copy link
Collaborator Author

here's the changes to llava that need to be made upstream:

diff --git a/llava/train/train.py b/llava/train/train.py
index cbfcc1b..f418a42 100644
--- a/llava/train/train.py
+++ b/llava/train/train.py
@@ -654,7 +654,7 @@ class LazySupervisedDataset(Dataset):
         length_list = []
         for sample in self.list_data_dict:
             cur_len = sum(len(conv['value'].split()) for conv in sample['conversations'])
-            cur_len = cur_len if 'image' in sample else -cur_len
+            cur_len = cur_len if 'images' in sample else -cur_len
             length_list.append(cur_len)
         return length_list
 
@@ -700,11 +700,11 @@ class LazySupervisedDataset(Dataset):
 
         # image exist in the data
         if 'image' in self.list_data_dict[i]:
-            data_dict['image'] = image
+            data_dict['images'] = image
         elif self.data_args.is_multimodal:
             # image does not exist in the data, but the model is multimodal
             crop_size = self.data_args.image_processor.crop_size
-            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
+            data_dict['images'] = torch.zeros(3, crop_size['height'], crop_size['width'])
         return data_dict
 
 
@@ -732,8 +732,8 @@ class DataCollatorForSupervisedDataset(object):
             attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
         )
 
-        if 'image' in instances[0]:
-            images = [instance['image'] for instance in instances]
+        if 'images' in instances[0]:
+            images = [instance['images'] for instance in instances]
             if all(x is not None and x.shape == images[0].shape for x in images):
                 batch['images'] = torch.stack(images)
             else:

@winglian
Copy link
Collaborator Author

Upstream PR here haotian-liu/LLaVA#694

@winglian
Copy link
Collaborator Author

git clone https://github.com/OpenAccess-AI-Collective/LLaVA.git
cd LLaVA
git checkout images-name-fix
pip install --no-deps -e .

@winglian
Copy link
Collaborator Author

there are definitely optimizations as the LazySupervisedDataset processes all the images on the fly, thus bouncing between the image model and the text model. We could probably preprocess the entire dataset similar to our existing workflows, and also eventually enable sample packing for this https://github.com/haotian-liu/LLaVA/blob/66044b727e30f589c6dbf7b58fce021b73566b36/llava/train/train.py#L660-L707

@ritabratamaiti
Copy link

Hey, was this the branch used for training openaccess-ai-collective/mistral-7b-llava-1_5-pretrained-projector. If so, when will it be merged into main? Is it recommended to use this branch in the meantime if we want to train multimodal models with axolotl?

@ManuelFay
Copy link

Any updates on this PR ? @winglian

@ZQ-Dev8
Copy link

ZQ-Dev8 commented Apr 4, 2024

+1 for llava finetuning with axolotl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wip
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants