Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest changes introduced for continuous batching break Mixtral model #84

Open
dacorvo opened this issue Apr 15, 2024 · 5 comments
Open
Labels
bug Something isn't working

Comments

@dacorvo
Copy link

dacorvo commented Apr 15, 2024

In the latest AWS Neuron SDK 2.18.1 release, the transformers-neuronx package has been updated to a new version 0.10.0.360 whose code is not available in this repository at the moment.

One of the change is to 'fix' continuous batching, but it actually breaks the Mixtral model.

The symptom is that the first call to forward after encoding fails with:

    def forward(self, input_ids, cache_ids=None, start_ids=None):                           
        # Compute the window starting index for specific mask patterns                                                                                                                  
        # For other patterns we pass in a default value of 0, it won't be used                                                                                                                                                                                                                                                                              
>       curr_window_start = max(0, self.num_processed_tokens - self.config.window_size) if self.config.window_size else 0                                                               
E       RuntimeError: Boolean value of Tensor with more than one value is ambiguous

The root cause is a modification in the base.py file, method _prepare_for_par_ctx_rhs_padding line 265.

The last_token_id returned value used to be a scalar, but can now be a vector. This leads to self.numprocessed_tokens to also become a vector, which causes the error.

@hannanjgaws
Copy link
Contributor

Thank you for filing the issue. We have found a fix for the problem and it will be available in an upcoming release.

@hannanjgaws hannanjgaws added the bug Something isn't working label Apr 16, 2024
@hannanjgaws
Copy link
Contributor

Currently continuous batching support has only been officially released with Llama: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#overview-of-continuous-batching-api-and-vllm-support

Mistral/Mixtral are planned for future releases. We will update this ticket when we have released official support for the Mixtral model.

@dacorvo
Copy link
Author

dacorvo commented Apr 17, 2024

Then Mistral and Mixtral are actually not supported, because static batching with padding (the alternative to continuous batching) has been broken for all models since the introduction of continuous batching: #79. Or has it been fixed ?

@aws-rhsoln
Copy link

We had the 2.19 release going out this week. With this new release we have now added support for Mistral. Support for Mixtral would be added in one of the upcoming releases.

@zhouku92
Copy link

which AWS Neuron Image shall I roll back in order to correctly run Mixstral?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants