Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable IPEXModel on XPU #663

Open
jiqing-feng opened this issue Apr 16, 2024 · 4 comments
Open

Enable IPEXModel on XPU #663

jiqing-feng opened this issue Apr 16, 2024 · 4 comments

Comments

@jiqing-feng
Copy link
Collaborator

jiqing-feng commented Apr 16, 2024

Hi @echarlaix . I want to enable all model utils in ipex (modeling_utils) on XPU; it may need some changes including another if-branch in forward or 2 forward functions (1 for CPU and 1 for GPU), the k-v cache is also different.

Is there any XPU issue on optimum-intel that may block our work, like the XPU version and CI tests? I also need your integration advice. Thx!

@echarlaix
Copy link
Collaborator

Hi @jiqing-feng, would it be a similar integration to what was integrated in ipex-llm ?

@jiqing-feng
Copy link
Collaborator Author

jiqing-feng commented Apr 17, 2024

Hi @jiqing-feng, would it be a similar integration to what was integrated in ipex-llm ?

Not exactly the same, we plan to keep only one attn forward but will split into different parts and will let tensor device to chose which op should be used, like

llama_attn_forward:
        key_cache, value_cache = preprocess_for_optimize(hidden_states, past_key_value, kwargs)
        query, key, value = self.qkv_gemm(hidden_states, key_cache, value_cache, kwargs)
        key, value = self.rope(key, value, position_ids, past_key_value, kwargs)
        present = get_present(key, value, past_key_value)
        attn_output, attn_weight, past_key_value = self.sdpa(query, key, value, attention_mask, past_key_value, kwargs)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
        if not output_attentions:
             attn_weights = None
        return attn_output, attn_weights, past_key_value 

self.sdpa:
        if cpu:   
            sdpa = self.ipex_scale_dot_product
        elif xpu:
           sdpa = self.sdpa_xpu
    
       (attn_output, attn_weights, past_key_value) =  sdpa (
                query,
                key,
                value,
                math.sqrt(self.head_dim),
                past_key_value,
                None,
                attention_mask,)
    
        return attn_output, attn_weights, past_key_value       

@echarlaix
Copy link
Collaborator

For me it would make sense to keep this integration to ipex-llm and to only enable loading of exported model in optimum-intel (through IPEXModel), what do you think ?

@echarlaix
Copy link
Collaborator

echarlaix commented May 14, 2024

Hi @jiqing-feng, I see that different llama modeling (and other additional architectures) were introduced in both ipex and ipex-llm to introduce ipex optimization. I think redefining the modeling of transformers modeling (for different architecture and different optimziation) is not something that we want to introduce in optimum-intel as it will results in significant code additions which will be difficult to maintain in the future and more importantly might cause issues in the future for future transformers release (this happened for example after transformers v4.40.0 release for the openvino export as the model is patched before export #682), having such additions could also results in having much constrained transformers or even torch version.

For these reasons I'd be in favor of keeping modeling_utils only for adding changes that are required for the export (like done in https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/utils/modeling_utils.py#L25) and to move the rest to an other repo (itrex or ipex-llm could be good candidates for example) that could be used by optimum-intel by checking for a specific transformers version if compatible in which case we could overwrite the modeling). What do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants