Latest changes introduced for continuous batching break Mixtral model #84

dacorvo · 2024-04-15T16:03:56Z

In the latest AWS Neuron SDK 2.18.1 release, the transformers-neuronx package has been updated to a new version 0.10.0.360 whose code is not available in this repository at the moment.

One of the change is to 'fix' continuous batching, but it actually breaks the Mixtral model.

The symptom is that the first call to forward after encoding fails with:

    def forward(self, input_ids, cache_ids=None, start_ids=None):                           
        # Compute the window starting index for specific mask patterns                                                                                                                  
        # For other patterns we pass in a default value of 0, it won't be used                                                                                                                                                                                                                                                                              
>       curr_window_start = max(0, self.num_processed_tokens - self.config.window_size) if self.config.window_size else 0                                                               
E       RuntimeError: Boolean value of Tensor with more than one value is ambiguous

The root cause is a modification in the base.py file, method _prepare_for_par_ctx_rhs_padding line 265.

The last_token_id returned value used to be a scalar, but can now be a vector. This leads to self.numprocessed_tokens to also become a vector, which causes the error.

The text was updated successfully, but these errors were encountered:

hannanjgaws · 2024-04-16T15:43:04Z

Thank you for filing the issue. We have found a fix for the problem and it will be available in an upcoming release.

hannanjgaws · 2024-04-16T16:45:55Z

Currently continuous batching support has only been officially released with Llama: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#overview-of-continuous-batching-api-and-vllm-support

Mistral/Mixtral are planned for future releases. We will update this ticket when we have released official support for the Mixtral model.

dacorvo · 2024-04-17T06:51:47Z

Then Mistral and Mixtral are actually not supported, because static batching with padding (the alternative to continuous batching) has been broken for all models since the introduction of continuous batching: #79. Or has it been fixed ?

aws-rhsoln · 2024-07-05T17:32:30Z

We had the 2.19 release going out this week. With this new release we have now added support for Mistral. Support for Mixtral would be added in one of the upcoming releases.

zhouku92 · 2024-07-18T21:41:24Z

which AWS Neuron Image shall I roll back in order to correctly run Mixstral?

hannanjgaws added the bug Something isn't working label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest changes introduced for continuous batching break Mixtral model #84

Latest changes introduced for continuous batching break Mixtral model #84

dacorvo commented Apr 15, 2024

hannanjgaws commented Apr 16, 2024

hannanjgaws commented Apr 16, 2024

dacorvo commented Apr 17, 2024

aws-rhsoln commented Jul 5, 2024

zhouku92 commented Jul 18, 2024

Latest changes introduced for continuous batching break Mixtral model #84

Latest changes introduced for continuous batching break Mixtral model #84

Comments

dacorvo commented Apr 15, 2024

hannanjgaws commented Apr 16, 2024

hannanjgaws commented Apr 16, 2024

dacorvo commented Apr 17, 2024

aws-rhsoln commented Jul 5, 2024

zhouku92 commented Jul 18, 2024