getting SLOW imatrix completion on NVIDIA #9919

robbiemu · 2024-10-17T00:13:27Z

robbiemu
Oct 17, 2024

I'm trying to do a robust quantization over 34 languages, so I have a rather large dataset for the llama-imatrix file. On my m3 Max macbook, this will take >260hrs, and I've had some crashes it seems every time I wake the screensaver, so I decided to try using vast.ai hosting. I am using the ghcr.io/ggerganov/llama.cpp:full-cuda image from this repo.

Something definitely is amiss though, On my macbook m3 max, the ETA to complete from llama-imatrix is ~260hrs

The first time I tries, the system I chose was an AMD Epic with 4x4090. the ETA was 660 hrs. I was expecting ~8 fold improvement, not ~3 fold slowdown.

I used a script to execute the command though, and it didn't record the commands used, I worried it might not have captured the number of gpus correctly and ended up passing bad params. So, I tried again. This time on a Xeon server with 4x4090s.

/app/llama-imatrix -m /workspace/salamandra-7b_bf16.gguf -f /workspace/imatrix-dataset.txt -o imatrix.dat -ngl -1 --n-gpu-layers -1 -sm layer -ts 1,1,1,1 --ctx-size 8192 --rope-freq-base 10000.0 --top-p 0.95 --temp 0 --repeat-penalty 1.2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 3931 (73afe681) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /workspace/salamandra-7b_bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 7.8B
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   5:                          general.languages arr[str,36]      = ["bg", "ca", "code", "cs", "cy", "da"...
llama_model_loader: - kv   6:                          llama.block_count u32              = 32
llama_model_loader: - kv   7:                       llama.context_length u32              = 8192
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 32
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 256000
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  17:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,256000]  = ["<unk>", "<s>", "</s>", "<pad>", "<|...
llama_model_loader: - kv  21:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  26:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  27:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type bf16:  226 tensors
llm_load_vocab: control token:     99 '<|reserved_token_94|>' is not marked as EOG
llm_load_vocab: control token:     79 '<|reserved_token_74|>' is not marked as EOG
llm_load_vocab: control token:     35 '<|reserved_token_30|>' is not marked as EOG
llm_load_vocab: control token:     44 '<|reserved_token_39|>' is not marked as EOG
llm_load_vocab: control token:     24 '<|reserved_token_19|>' is not marked as EOG
llm_load_vocab: control token:     29 '<|reserved_token_24|>' is not marked as EOG
llm_load_vocab: control token:     28 '<|reserved_token_23|>' is not marked as EOG
llm_load_vocab: control token:    101 '<|reserved_token_96|>' is not marked as EOG
llm_load_vocab: control token:     88 '<|reserved_token_83|>' is not marked as EOG
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:     90 '<|reserved_token_85|>' is not marked as EOG
llm_load_vocab: control token:     87 '<|reserved_token_82|>' is not marked as EOG
llm_load_vocab: control token:      4 '<|im_start|>' is not marked as EOG
llm_load_vocab: control token:     86 '<|reserved_token_81|>' is not marked as EOG
llm_load_vocab: control token:     36 '<|reserved_token_31|>' is not marked as EOG
llm_load_vocab: control token:     38 '<|reserved_token_33|>' is not marked as EOG
llm_load_vocab: control token:     48 '<|reserved_token_43|>' is not marked as EOG
llm_load_vocab: control token:     89 '<|reserved_token_84|>' is not marked as EOG
llm_load_vocab: control token:      7 '<|reserved_token_2|>' is not marked as EOG
llm_load_vocab: control token:     40 '<|reserved_token_35|>' is not marked as EOG
llm_load_vocab: control token:     20 '<|reserved_token_15|>' is not marked as EOG
llm_load_vocab: control token:     91 '<|reserved_token_86|>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: control token:     70 '<|reserved_token_65|>' is not marked as EOG
llm_load_vocab: control token:     75 '<|reserved_token_70|>' is not marked as EOG
llm_load_vocab: control token:     12 '<|reserved_token_7|>' is not marked as EOG
llm_load_vocab: control token:     52 '<|reserved_token_47|>' is not marked as EOG
llm_load_vocab: control token:     72 '<|reserved_token_67|>' is not marked as EOG
llm_load_vocab: control token:      9 '<|reserved_token_4|>' is not marked as EOG
llm_load_vocab: control token:     74 '<|reserved_token_69|>' is not marked as EOG
llm_load_vocab: control token:     94 '<|reserved_token_89|>' is not marked as EOG
llm_load_vocab: control token:     77 '<|reserved_token_72|>' is not marked as EOG
llm_load_vocab: control token:     76 '<|reserved_token_71|>' is not marked as EOG
llm_load_vocab: control token:     41 '<|reserved_token_36|>' is not marked as EOG
llm_load_vocab: control token:     42 '<|reserved_token_37|>' is not marked as EOG
llm_load_vocab: control token:     22 '<|reserved_token_17|>' is not marked as EOG
llm_load_vocab: control token:     32 '<|reserved_token_27|>' is not marked as EOG
llm_load_vocab: control token:     18 '<|reserved_token_13|>' is not marked as EOG
llm_load_vocab: control token:     66 '<|reserved_token_61|>' is not marked as EOG
llm_load_vocab: control token:     96 '<|reserved_token_91|>' is not marked as EOG
llm_load_vocab: control token:     93 '<|reserved_token_88|>' is not marked as EOG
llm_load_vocab: control token:     85 '<|reserved_token_80|>' is not marked as EOG
llm_load_vocab: control token:     47 '<|reserved_token_42|>' is not marked as EOG
llm_load_vocab: control token:     61 '<|reserved_token_56|>' is not marked as EOG
llm_load_vocab: control token:     71 '<|reserved_token_66|>' is not marked as EOG
llm_load_vocab: control token:     51 '<|reserved_token_46|>' is not marked as EOG
llm_load_vocab: control token:     43 '<|reserved_token_38|>' is not marked as EOG
llm_load_vocab: control token:      8 '<|reserved_token_3|>' is not marked as EOG
llm_load_vocab: control token:     14 '<|reserved_token_9|>' is not marked as EOG
llm_load_vocab: control token:     39 '<|reserved_token_34|>' is not marked as EOG
llm_load_vocab: control token:     68 '<|reserved_token_63|>' is not marked as EOG
llm_load_vocab: control token:     50 '<|reserved_token_45|>' is not marked as EOG
llm_load_vocab: control token:     46 '<|reserved_token_41|>' is not marked as EOG
llm_load_vocab: control token:     80 '<|reserved_token_75|>' is not marked as EOG
llm_load_vocab: control token:     56 '<|reserved_token_51|>' is not marked as EOG
llm_load_vocab: control token:      6 '<|reserved_token_1|>' is not marked as EOG
llm_load_vocab: control token:     84 '<|reserved_token_79|>' is not marked as EOG
llm_load_vocab: control token:     16 '<|reserved_token_11|>' is not marked as EOG
llm_load_vocab: control token:     17 '<|reserved_token_12|>' is not marked as EOG
llm_load_vocab: control token:     64 '<|reserved_token_59|>' is not marked as EOG
llm_load_vocab: control token:     83 '<|reserved_token_78|>' is not marked as EOG
llm_load_vocab: control token:      3 '<pad>' is not marked as EOG
llm_load_vocab: control token:     19 '<|reserved_token_14|>' is not marked as EOG
llm_load_vocab: control token:    102 '<|reserved_token_97|>' is not marked as EOG
llm_load_vocab: control token:     81 '<|reserved_token_76|>' is not marked as EOG
llm_load_vocab: control token:      0 '<unk>' is not marked as EOG
llm_load_vocab: control token:     23 '<|reserved_token_18|>' is not marked as EOG
llm_load_vocab: control token:     98 '<|reserved_token_93|>' is not marked as EOG
llm_load_vocab: control token:     15 '<|reserved_token_10|>' is not marked as EOG
llm_load_vocab: control token:     33 '<|reserved_token_28|>' is not marked as EOG
llm_load_vocab: control token:     60 '<|reserved_token_55|>' is not marked as EOG
llm_load_vocab: control token:     65 '<|reserved_token_60|>' is not marked as EOG
llm_load_vocab: control token:     78 '<|reserved_token_73|>' is not marked as EOG
llm_load_vocab: control token:     31 '<|reserved_token_26|>' is not marked as EOG
llm_load_vocab: control token:    103 '<|reserved_token_98|>' is not marked as EOG
llm_load_vocab: control token:     55 '<|reserved_token_50|>' is not marked as EOG
llm_load_vocab: control token:     82 '<|reserved_token_77|>' is not marked as EOG
llm_load_vocab: control token:     63 '<|reserved_token_58|>' is not marked as EOG
llm_load_vocab: control token:     54 '<|reserved_token_49|>' is not marked as EOG
llm_load_vocab: control token:     26 '<|reserved_token_21|>' is not marked as EOG
llm_load_vocab: control token:     73 '<|reserved_token_68|>' is not marked as EOG
llm_load_vocab: control token:     25 '<|reserved_token_20|>' is not marked as EOG
llm_load_vocab: control token:     45 '<|reserved_token_40|>' is not marked as EOG
llm_load_vocab: control token:     30 '<|reserved_token_25|>' is not marked as EOG
llm_load_vocab: control token:     10 '<|reserved_token_5|>' is not marked as EOG
llm_load_vocab: control token:     49 '<|reserved_token_44|>' is not marked as EOG
llm_load_vocab: control token:     27 '<|reserved_token_22|>' is not marked as EOG
llm_load_vocab: control token:     67 '<|reserved_token_62|>' is not marked as EOG
llm_load_vocab: control token:     13 '<|reserved_token_8|>' is not marked as EOG
llm_load_vocab: control token:     57 '<|reserved_token_52|>' is not marked as EOG
llm_load_vocab: control token:     37 '<|reserved_token_32|>' is not marked as EOG
llm_load_vocab: control token:     59 '<|reserved_token_54|>' is not marked as EOG
llm_load_vocab: control token:     58 '<|reserved_token_53|>' is not marked as EOG
llm_load_vocab: control token:     11 '<|reserved_token_6|>' is not marked as EOG
llm_load_vocab: control token:     97 '<|reserved_token_92|>' is not marked as EOG
llm_load_vocab: control token:     69 '<|reserved_token_64|>' is not marked as EOG
llm_load_vocab: control token:     92 '<|reserved_token_87|>' is not marked as EOG
llm_load_vocab: control token:     62 '<|reserved_token_57|>' is not marked as EOG
llm_load_vocab: control token:     53 '<|reserved_token_48|>' is not marked as EOG
llm_load_vocab: control token:     95 '<|reserved_token_90|>' is not marked as EOG
llm_load_vocab: control token:     21 '<|reserved_token_16|>' is not marked as EOG
llm_load_vocab: control token:    100 '<|reserved_token_95|>' is not marked as EOG
llm_load_vocab: control token:     34 '<|reserved_token_29|>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 104
llm_load_vocab: token to piece cache size = 1.8842 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 7.77 B
llm_load_print_meta: model size       = 14.47 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: EOT token        = 5 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 145 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: EOG token        = 5 '<|im_end|>'
llm_load_print_meta: max token length = 72
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size = 14817.02 MiB
............................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    24.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   552.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 196

system_info: n_threads = 72 (n_threads_batch = 72) / 144 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..

This ETA was 636 hours, sorry I cut it off on the version of the log that I downloaded.

Why is it taking almost 3x as long to do this on a 4x4090 machine as it is locally?

Answered by slaren

Oct 17, 2024

The CUDA backend also does not support BF16, so most of the model is running on the CPU. Try a F16 model instead.

Also note that -ngl -1 does not work the way you might expect, no layers will be offloaded that way. Use a large number to offload the entire model instead, eg. -ngl 99.

View full answer

slaren · 2024-10-17T00:18:54Z

slaren
Oct 17, 2024
Collaborator

The CUDA backend also does not support BF16, so most of the model is running on the CPU. Try a F16 model instead.

Also note that -ngl -1 does not work the way you might expect, no layers will be offloaded that way. Use a large number to offload the entire model instead, eg. -ngl 99.

5 replies

robbiemu Oct 17, 2024
Author

ahaha when you said "keep in mind that BF16 is not supported in the Metal backend" last time I thought that it was unique to Metal :D

Damn, but I do want to do this in native bf16 since the original is. Is there no acceleration available to me?

(even still, my Mac laptop outperforming a Xeon 3x is still kinda surprising)

slaren Oct 17, 2024
Collaborator

F16 is just as good in essentially every case. Most models are normalized so that the weights have a distribution of $\mathcal{N}(0,1)$, so the chance of the model having any values that cannot be represented in F16 is essentially zero. I think that the increased range of BF16 may have a higher importance during training, but during inference it is not necessary. For creating an imatrix, Q8_0 would also work well.

robbiemu Oct 17, 2024
Author

I .. thank you. can you give me a reference regarding that latter claim? I.. the purpose to creating these imatrices is so when I quantize it, I minimize ppl loss (over the target languages). You're saying, if I understand this correctly, I could go straight to Q8_0 with no imatrix, generate the imatrix from that, then go back and use that imatrix to recreate q8 and all the stronger quants?

slaren Oct 17, 2024
Collaborator

See for example #5222. Admittedly, the results may be different with different models, but typically Q8_0 is very close to F16. Q8_0 cannot use an imatrix (it does not need it), but you can use the imatrix created with a Q8_0 model to create the smaller quants.

robbiemu Oct 17, 2024
Author

wow this has been such a treasure of a discussion, thank you @slaren
"Q8_0 cannot use an imatrix (it does not need it)" was another thing new to me. I was paying the imatrix to it anyway at that quantization level 🤣

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting SLOW imatrix completion on NVIDIA #9919

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

getting SLOW imatrix completion on NVIDIA #9919

robbiemu Oct 17, 2024

Replies: 1 comment · 5 replies

slaren Oct 17, 2024 Collaborator

robbiemu Oct 17, 2024 Author

slaren Oct 17, 2024 Collaborator

robbiemu Oct 17, 2024 Author

slaren Oct 17, 2024 Collaborator

robbiemu Oct 17, 2024 Author

robbiemu
Oct 17, 2024

Replies: 1 comment 5 replies

slaren
Oct 17, 2024
Collaborator

robbiemu Oct 17, 2024
Author

slaren Oct 17, 2024
Collaborator

robbiemu Oct 17, 2024
Author

slaren Oct 17, 2024
Collaborator

robbiemu Oct 17, 2024
Author