Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Gemma2-2b model for inference in Android #5610

Open
FranzKafkaYu opened this issue Sep 5, 2024 · 12 comments
Open

Support Gemma2-2b model for inference in Android #5610

FranzKafkaYu opened this issue Sep 5, 2024 · 12 comments
Assignees
Labels
platform:android Issues with Android as Platform task:LLM inference Issues related to MediaPipe LLM Inference Gen AI setup type:feature Enhancement in the New Functionality or Request for a New Solution

Comments

@FranzKafkaYu
Copy link

FranzKafkaYu commented Sep 5, 2024

MediaPipe Solution (you are using)

Android library:com.google.mediapipe:tasks-genai:0.10.14

Programming language

Android Java

Are you willing to contribute it

None

Describe the feature and the current behaviour/state

currently we have no suitable MediaPipe format for Gemma2-2b running in Android,MediaPipe Python libraries can't complete conversion

Will this change the current API? How?

no

Who will benefit with this feature?

all of us

Please specify the use cases for this feature

use the latest Gemma2 model with mediapipe

Any Other info

No response

@FranzKafkaYu FranzKafkaYu added the type:feature Enhancement in the New Functionality or Request for a New Solution label Sep 5, 2024
@FranzKafkaYu
Copy link
Author

related issue:#5570

from MediaPipe official website.it says we can use AI Edge Torch to convert Gemma2-2b to suitable format but there are no more details:
image

If MediaPipe Python Convert tool can support this conversion it would be good.

thanks for all of you developers.

@kuaashish kuaashish added platform:android Issues with Android as Platform task:LLM inference Issues related to MediaPipe LLM Inference Gen AI setup labels Sep 5, 2024
@kuaashish kuaashish added the stat:awaiting googler Waiting for Google Engineer's Response label Sep 5, 2024
@KennethanCeyer
Copy link

Hi @FranzKafkaYu,
I am currently looking into running Gemma 2 on AI Edge.
Would it be possible to verify the source of the referenced image?

Came across a similar source,
which includes a guide for running with tflite,
and am now validating the reproducibility of that setup.

image
(Source: https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference#gemma-2_2b)

Thanks in advance.


Other related issues:

#5594

@FranzKafkaYu FranzKafkaYu changed the title Support Gemma2-2b model Support Gemma2-2b model for inference in Android Sep 5, 2024
@KennethanCeyer
Copy link

KennethanCeyer commented Sep 5, 2024

It seems that the issue was raised based on the following link:
https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android#model

image

The method for converting using AI Edge Torch is detailed in the guidelines provided in the above link.
Unfortunately, it seems that for now, the .tflite conversion must be done manually.

Based on this, it seems the conversion process would be as follows:
Downloading the .ckpt file via Kaggle -> Converting to .tflite using AI Edge Torch -> Implementing Android inference with the .tflite file using MediaPipe.

graph TD
    A[Kaggle .ckpt file] --> B[AI Edge Torch .tflite conversion]
    B --> C[MediaPipe Android inference]
Loading

P.S. It seems there might be a typo in the Android guide. "AI Edge Troch" should be corrected to "AI Edge Torch in the website.

@KennethanCeyer
Copy link

KennethanCeyer commented Sep 5, 2024

@FranzKafkaYu
Copy link
Author

FranzKafkaYu commented Sep 5, 2024

Hi @FranzKafkaYu, I am currently looking into running Gemma 2 on AI Edge. Would it be possible to verify the source of the referenced image?

Came across a similar source, which includes a guide for running with tflite, and am now validating the reproducibility of that setup.

image (Source: https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference#gemma-2_2b)

Thanks in advance.

Other related issues:

#5594

@KennethanCeyer Hi Ken,you can find more details via this link,if you wanna use a model with MediaPipe Solutions/Framework,you need to conver model,safetensors/pytorch format->tflite format->MediaPipe format.

currently if you use Gemma while not Gemma2,there are suitable formated models from kaggle,you can check this link,while Gemma2 doesn't.

MediaPipe has provided a python library for converting safetensors/pytorch format->MediaPipe format with two different methods,details here,but now this library doesn't support Gemma2 in Native model conversion,so the only choice is AI Edge model conversion,which need use AI Edge Torch tool to convert first to get the TFLite format and then use MediaPipe Python library to bundle the model.

But I have checked AI Edge Torch,it lacks details to how can we complete this convert first,and in MediaPipe LLM Inference API demonstrations there are little informations about how can we use these “bundled model”,which is ended with *.task,the sample code used a native model,which is ended with *.bin.

I have tried other project,like llama.cpp and gemma.cpp,the performance is not good because they mainlly use CPU to excute inference.You can have a try but I think MediaPipe witch GPU backend would be better.

I am not a native English speaker,so my English is not very good.Hope these info can help you.

@FranzKafkaYu
Copy link
Author

I think the documentation should mention about https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/gemma/convert_gemma2_to_tflite.py

GOOD,I will try this script and see whether we can go to the next step

@KennethanCeyer
Copy link

KennethanCeyer commented Sep 5, 2024

Hi @FranzKafkaYu, Thank you for the explanation you've done an excellent job explaining the situation.

I've actually been investigating the same issue of using Gemma 2 with LiteRT(.tflite) on MediaPipe, which is what brought me to this discussion. From all the issues, code records, and documentation I've reviewed, it seems .tflite distribution of Gemma 2 hasn't yet been registered .tffile in the kaggle or huggingface registry. (It looks like they're working hard on this and it seems probably in their roadmap, but there's no official file available yet.)

Based on the most recent visible documentation, it appears we need to convert the .ckpt file to .tffile using AI Edge Torch, and then use it according to each specific use case. (It seems like the documentation is lacking. It doesn’t look like it’s been around for very long)

The code I mentioned above seems to be the closest thing to an official guide at the moment. I'm currently working on this myself, and I'm planning to write a blog post about it when I'm done. Once it's ready, I'll make sure to share the link here in this issue for reference.

Thanks again for your helpful insights and creating this issue, Franz.

@KennethanCeyer
Copy link

KennethanCeyer commented Sep 5, 2024

With quite a few questions expected around running Gemma 2 with MediaPipe, I made a Colab used for the conversion along with related issues and PRs. The notebook will be continuously updated until the official tflite or MediaPipe tasks are released.

@kuaashish
Copy link
Collaborator

Hi @FranzKafkaYu,

Apologies for the delayed response. Support for Gemma 2-2B is now available, and ongoing discussions are happening here. Please let us know if you require any further assistance, or if we can proceed to close the issue and mark it as internally resolved, as the feature has been implemented.

Thank you!!

@kuaashish kuaashish added stat:awaiting response Waiting for user response and removed stat:awaiting googler Waiting for Google Engineer's Response labels Sep 11, 2024
@jfduma
Copy link

jfduma commented Sep 12, 2024

Hi @FranzKafkaYu ,
I've been encountering an issue when trying to run the script ai-edge-torch/ai_edge_torch/generative/examples/gemma/convert_gemma2_to_tflite.py or gemma2-to-tflite/convert.py. In both cases, the error happens at the line where the code tries to load a file using torch.load(file).

On Google Colab:
this_file_tensors = torch.load(file)
^C
(this ^C is not caused by pressing Ctrl+C on the keyboard, it happens automatically)

On local machine: The same line outputs Segmentation Fault.
this_file_tensors = torch.load(file)
Segmentation Fault

I've checked my system's memory, and it's not an issue of insufficient memory. The same error occurs consistently in both environments. Any suggestions on what could be causing this segmentation fault or how to troubleshoot further would be greatly appreciated! Thanks in advance!

colab logs:
2024-09-12 08:01:43.412352: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1726128103.448539 2980 cuda_dnn.cc:8322] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1726128103.459941 2980 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-12 08:01:43.505942: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/usr/local/lib/python3.10/dist-packages/torch_xla/init.py:202: UserWarning: tensorflow can conflict with torch-xla. Prefer tensorflow-cpu when using PyTorch/XLA. To silence this warning, pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/ai_edge_torch/generative/utilities/loader.py:84: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
this_file_tensors = torch.load(file)
^C

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Waiting for user response label Sep 12, 2024
@kuaashish kuaashish added the stat:awaiting googler Waiting for Google Engineer's Response label Sep 23, 2024
@talumbau talumbau self-assigned this Oct 28, 2024
@talumbau
Copy link
Contributor

Hi,

Just wanted to update this issue with the latest info. Previously (as is discussed in this issue), Gemma 2 2B was only available in the LLM Inference API by going through a conversion pathway via ai_edge_torch. This was difficult for many people (especially due to the large memory requirements for conversion and quantization of the float checkpoint). So we have made the .task files of a quantized version of Gemma 2 available on Kaggle directly

image

They have the extension .task. You use these files just like any other with the LLM Inference API. Essentially, these files contain the model weights as well as binary information on the tokenizer for the model. Please give that a try! Note: GPU and CPU models are available but GPU is mostly likely to work on newer and high-end phones for now. Thanks for trying out the Inference API. We hope to have more info to share soon!

@tyrmullen
Copy link

Tiny correction: the CPU model is a .task file, representing a successful conversion through ai_edge_torch, but the GPU model is a .bin file.

@talumbau talumbau removed the stat:awaiting googler Waiting for Google Engineer's Response label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:android Issues with Android as Platform task:LLM inference Issues related to MediaPipe LLM Inference Gen AI setup type:feature Enhancement in the New Functionality or Request for a New Solution
Projects
None yet
Development

No branches or pull requests

7 participants