Replies: 4 comments 6 replies
-
What was the fix? |
Beta Was this translation helpful? Give feedback.
-
Amazing! It also speeds up the llama-30b-4bit-128g. Before the patch, the delay was 11 seconds. After the patch, the delay is 3 seconds!! BTW, how do you run the 65bit model? You happen to have two 3090s? |
Beta Was this translation helpful? Give feedback.
-
I'm still super confused as to which GPTQ branch I need to be using. Seems many of the models are coming out based on the new one, but should we be using the old branch? |
Beta Was this translation helpful? Give feedback.
-
You need triton for act order + group size together I think. If your models don't use that, its fine to use the oobabooga fork or any other v2 one. |
Beta Was this translation helpful? Give feedback.
-
Please go update https://github.com/qwopqwop200/GPTQ-for-LLaMa and have fun
65b 4bit used it be 50s plus now look how fast it is
Beta Was this translation helpful? Give feedback.
All reactions