Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix import of llama2.c models that don't share weights between embedding layers #2685

Merged
merged 4 commits into from
Aug 23, 2023

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Aug 20, 2023

Edit: This is only needed to support reading llama2.c models w/ untied weights, models from karpathy/tinyllamas already load fine (see discussion below)

As it turns out, only the 42M model from the karpathy/tinyllamas repo shares weights between wcls and token_embedding_table.

For the 15M (10M) and 110M models, we need to load wcls separately (and also, handle the negative vocab_size which tells us not to share weights and currently crashes the importer - see what llama2.c does here).

Tested with:

make clean && make main convert-llama2c-to-ggml

for m in stories110M stories42M stories15M ; do
    ./convert-llama2c-to-ggml --llama2c-model $m.bin --llama2c-output-model $m.ggml.bin
    ./main -m $m.ggml.bin --temp 0 -n 128 -p "Lily was always" | tee $m.txt
done

(conversion exe was introduced @byte-6174 in #2559)

@byte-6174
Copy link
Contributor

Yes. This patch will be needed.
I'll test a little bit later and report back how well it works. The changes look right.

@karpathy
Copy link

So actually I wasn't even aware of this. My model.py code explicitly ties these parameters, how did they get untied?

@byte-6174
Copy link
Contributor

hmm, trying to repro this @ochafik .
I know for 7B model we have to address untied weights, and I assumed you were talking about that, my bad...

for the 42M model, I don't see the importer crashing?!
can you attach ur crashlog perhaps?!

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

for the 42M model, I don't see the importer crashing?!

Yes the 42M model is being converted fine (that one does share weights), the crash happens w/ the 15M and 110M models (which don't share weights; the crash comes from newing a vocab array with a -32000 size).

can you attach ur crashlog perhaps?!

Will try later tonight

@byte-6174
Copy link
Contributor

All of them share weights. and all 3 get converted as expected and give output as expected.
Would like to see where it is failing for you.
thanks

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

So actually I wasn't even aware of this. My model.py code explicitly ties these parameters, how did they get untied?

@karpathy this does seem mysterious, the possibility of a negative vocab_size in model.py seems to predate its public version history

(FWIW I first ran into this while doing a port of llama2.c to adhoc ggml python bindings (ggerganov/ggml#449))

I didn't actually check whether the serialized weights are different, might provide clues as to what happened. Will try later today.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

All of them share weights. and all 3 get converted as expected and give output as expected.
Would like to see where it is failing for you.

@byte-6174 thanks for checking, I'll check again and report back!

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

Ah, mystery solved, sorry about the confusion.

I thought these two 15B & 110B bin files were the originals, but turns out I had converted them myself from the stories*M.pt files using a modified export_meta_llama_bin.py, and that export script used to write the two weights separately (and negative vocab_size) no matter what (looks like now it's conditional to the weights being different)

This change may still be useful to read llama2.c-format models that were converted from other llama models w/ untied weights, but it's not needed to read the stories's original .bins.

@byte-6174
Copy link
Contributor

got it, thanks for the confirmation. yes, the changes are useful, like I said for models that don't share these tensors. like 7B model.
If you get time, can you test w a 7B model, so we can potentially use this PR?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up: after merging #2398 later today, the ggml model writing code in examples/convert-llama2c-to-ggml would need to be updated to match the new GGUF file format. Basic example of writing GGUF files is given here:

bool gguf_ex_write(const std::string & fname) {
struct gguf_context * ctx = gguf_init_empty();
gguf_set_val_u8 (ctx, "some.parameter.uint8", 0x12);
gguf_set_val_i8 (ctx, "some.parameter.int8", -0x13);
gguf_set_val_u16 (ctx, "some.parameter.uint16", 0x1234);
gguf_set_val_i16 (ctx, "some.parameter.int16", -0x1235);
gguf_set_val_u32 (ctx, "some.parameter.uint32", 0x12345678);
gguf_set_val_i32 (ctx, "some.parameter.int32", -0x12345679);
gguf_set_val_f32 (ctx, "some.parameter.float32", 0.123456789f);
gguf_set_val_bool(ctx, "some.parameter.bool", true);
gguf_set_val_str (ctx, "some.parameter.string", "hello world");
gguf_set_arr_data(ctx, "some.parameter.arr.i16", GGUF_TYPE_INT16, std::vector<int16_t>{ 1, 2, 3, 4, }.data(), 4);
gguf_set_arr_data(ctx, "some.parameter.arr.f32", GGUF_TYPE_FLOAT32, std::vector<float>{ 3.145f, 2.718f, 1.414f, }.data(), 3);
gguf_set_arr_str (ctx, "some.parameter.arr.str", std::vector<const char *>{ "hello", "world", "!" }.data(), 3);
struct ggml_init_params params = {
/*.mem_size =*/ 128ull*1024ull*1024ull,
/*.mem_buffer =*/ NULL,
/*.no_alloc =*/ false,
};
struct ggml_context * ctx_data = ggml_init(params);
const int n_tensors = 10;
// tensor infos
for (int i = 0; i < n_tensors; ++i) {
const std::string name = "tensor_" + to_string(i);
int64_t ne[GGML_MAX_DIMS] = { 1 };
int32_t n_dims = rand() % GGML_MAX_DIMS + 1;
for (int j = 0; j < n_dims; ++j) {
ne[j] = rand() % 10 + 1;
}
struct ggml_tensor * cur = ggml_new_tensor(ctx_data, GGML_TYPE_F32, n_dims, ne);
ggml_set_name(cur, name.c_str());
{
float * data = (float *) cur->data;
for (int j = 0; j < ggml_nelements(cur); ++j) {
data[j] = 100 + i;
}
}
gguf_add_tensor(ctx, cur);
}
gguf_write_to_file(ctx, fname.c_str(), false);
fprintf(stdout, "%s: wrote file '%s;\n", __func__, fname.c_str());
ggml_free(ctx_data);
gguf_free(ctx);
return true;
}

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

If you get time, can you test w a 7B model, so we can potentially use this PR?

@byte-6174 so I can confirm the 7B model no longer crashes early (it was giving libc++abi: terminating due to uncaught exception of type std::bad_alloc: std::bad_alloc), but...

It require more memory to convert (mem_model_gb = 30 did the trick), and inference with the resulting model crashed with LLAMA_ASSERT: llama.cpp:1436: n_embd_head == hparams.n_rot (n_embd_head = 128 and hparams.n_rot = 64). Editing params.n_rotmax = 128 produced a model that doesn't crash but that seems to give garbage output :-(

Happy to explore this further, but seems it may deserve a separate PR anyway

Conversion steps recap:

git clone --depth=1 https://huggingface.co/meta-llama/Llama-2-7b-chat # manual download = better

cd llama2.c
python export.py llama2_7b.bin --meta-llama  ../Llama-2-7b-chat/

cd ../llama.cpp
make clean && make main convert-llama2c-to-ggml  
./convert-llama2c-to-ggml --llama2c-model  ../llama2.c/llama2_7b.bin --llama2c-output-model out.bin
./main -m out.bin -p "Hello" --temp 0

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 21, 2023

@ochafik

Happy to explore this further, but seems it may deserve a separate PR anyway

yeah, the suggestion to use 7B here was because we know that that model doesn't share tensors. Dont think there is a need to actually solve this for 7B model.
Perhaps an easier route would be to train a sample model and follow llama2.c to convert that small model and then test.
I have not trained a model myself - so not sure what it takes to get it right...

@byte-6174
Copy link
Contributor

@ggerganov

the ggml model writing code in examples/convert-llama2c-to-ggml would need to be updated to match the new GGUF file format.

thanks for the headsup, will take a look at this once gguf is merged and stable.

@byte-6174
Copy link
Contributor

@ochafik fwiw, the conversion of meta provided 7B model was successfully converted and quantized and produced expected output within lamma2.c repo:
karpathy/llama2.c#298 (comment)

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 21, 2023

@byte-6174 I hadn't seen the progress on quantization there, very cool stuff!

I did also manage to load (and generate good output w/) the 7B model with my weird Python hybrid of llama2.c and llama.cpp (meant to be a better example for these autogenerated python bindings), and I was thinking of quantizing the model on the fly during loading, but that bit is still WIP.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

I've rebased this on master, @byte-6174 lemme know if I can help w/ the GGUF changes.

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

I guess you could follow the simple conversion script to export GGUF model files with all needed kvs:
https://github.com/ggerganov/llama.cpp/blob/master/convert-llama-7b-pth-to-gguf.py
and the C API:
https://github.com/ggerganov/llama.cpp/blob/master/examples/gguf/gguf.cpp

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

Perplexity runs on tinyllamas-stories models

Dataset: tinystories-valid.txt.1000 link

Using cublas without offloading
Since 260k model have max ctx 256 it is used for all runs.

All but F32 have Q6_K output tensor.

model F32 F16 Q8_0 Q4_0
260k 6.32911102 6.32938244 error error
15M 4.85355091 4.85356219 4.83956587 5.65603883
42M 4.33227600 4.33628450 4.33163189 4.60178523
110M 3.42448116 3.43129825 3.43210677 3.52336579

The error on quantized 260k model:

using cublas w/o offload: CUDA error 12 at ggml-cuda.cu:5811: invalid pitch argument
w/o blas: ppl > 30k

Model files converted from the original pytorch models:
https://huggingface.co/klosax/tinyllamas-stories-gguf/tree/main

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 22, 2023

FYI, I am also able to convert the previously generated .bin models to the .gguf if anyone is interested:

python3 convert-llama-ggmlv3-to-gguf.py --input stories15M.bin --output stories15M.gguf --name stories15M
python3 convert-llama-ggmlv3-to-gguf.py --input stories42M.bin --output stories42M.gguf --name stories42M
python3 convert-llama-ggmlv3-to-gguf.py --input stories110M.bin --output stories110M.gguf --name stories110M

@byte-6174
Copy link
Contributor

generating outputs like this:

./main -m stories110M.gguf --temp 0 -n 128 -p "Once upon a time there was a boy name Timmy"
main: build = 1015 (226255b)
main: seed  = 1692673470
llama_model_loader: loaded meta data with 15 key-value pairs and 111 tensors from stories110M.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f32      [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight f32      [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    3:           blk.0.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   10:            blk.0.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   11:              blk.0.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   12:           blk.1.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   19:            blk.1.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   20:              blk.1.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   21:           blk.2.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   25:         blk.2.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   28:            blk.2.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   29:              blk.2.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   30:           blk.3.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   34:         blk.3.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   37:            blk.3.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   38:              blk.3.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   39:           blk.4.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   43:         blk.4.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   46:            blk.4.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   47:              blk.4.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   48:           blk.5.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   52:         blk.5.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   55:            blk.5.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   56:              blk.5.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   57:           blk.6.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   61:         blk.6.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   64:            blk.6.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   65:              blk.6.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   66:           blk.7.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   70:         blk.7.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   73:            blk.7.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   74:              blk.7.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   75:           blk.8.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   79:         blk.8.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   82:            blk.8.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   83:              blk.8.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   84:           blk.9.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   88:         blk.9.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   91:            blk.9.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   92:              blk.9.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   93:          blk.10.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   97:        blk.10.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  100:           blk.10.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  101:             blk.10.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  102:          blk.11.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_q.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_k.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.attn_v.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor  106:        blk.11.attn_output.weight f32      [   768,   768,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_gate.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  109:           blk.11.ffn_down.weight f32      [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  110:             blk.11.ffn_up.weight f32      [   768,  2048,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                        general.description str
llama_model_loader: - kv   3:                       llama.context_length u32
llama_model_loader: - kv   4:                     llama.embedding_length u32
llama_model_loader: - kv   5:                          llama.block_count u32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32
llama_model_loader: - kv   8:                 llama.attention.head_count u32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - type  f32:  111 tensors
llama_model_load_internal: format       = GGUF V1 (latest)
llama_model_load_internal: arch         = llama
llama_model_load_internal: vocab type   = SPM
llama_model_load_internal: n_vocab      = 32000
llama_model_load_internal: n_ctx_train  = 2048
llama_model_load_internal: n_ctx        = 512
llama_model_load_internal: n_embd       = 768
llama_model_load_internal: n_head       = 12
llama_model_load_internal: n_head_kv    = 12
llama_model_load_internal: n_layer      = 12
llama_model_load_internal: n_rot        = 64
llama_model_load_internal: n_gqa        = 1
llama_model_load_internal: f_norm_eps   = 5.0e-06
llama_model_load_internal: n_ff         = 2048
llama_model_load_internal: freq_base    = 10000.0
llama_model_load_internal: freq_scale   = 1
llama_model_load_internal: model type   = 7B
llama_model_load_internal: model ftype  = all F32
llama_model_load_internal: model size   = 0.13 B
llama_model_load_internal: general.name = stories110M
llama_model_load_internal: BOS token = 1 ''
llama_model_load_internal: EOS token = 2 ''
llama_model_load_internal: LF token  = 13 '<0x0A>'
llama_model_load_internal: ggml ctx size =    0.03 MB
llama_model_load_internal: mem required  =  511.61 MB (+   18.00 MB per state)
llama_new_context_with_model: kv self size  =   18.00 MB
llama_new_context_with_model: compute buffer total size =   65.41 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 Once upon a time there was a boy name Timmy. He was three years old and loved to play outside. One day, he went to the park with his mom.
When they got there, Timmy saw a big tree with lots of leaves. He wanted to climb it, but his mom said no. She told him that it was too dangerous.
Timmy was sad, so he asked his mom if he could go and play on the swings instead. His mom said yes, so Timmy ran over to the swings.
But when he got there, he saw something strange. There was a big, green frog sitting in the middle of the swing
llama_print_timings:        load time =   456.15 ms
llama_print_timings:      sample time =   100.85 ms /   128 runs   (    0.79 ms per token,  1269.26 tokens per second)
llama_print_timings: prompt eval time =    23.10 ms /    12 tokens (    1.93 ms per token,   519.44 tokens per second)
llama_print_timings:        eval time =  1022.25 ms /   127 runs   (    8.05 ms per token,   124.24 tokens per second)
llama_print_timings:       total time =  1158.20 ms

@byte-6174
Copy link
Contributor

Further more, quantization of these gguf models works well too. Converted all versions of the models to Q8_0 and Q4_0 and the output is good!
For example:
./quantize --allow-requantize stories110M.gguf stories110M_Q4_0.gguf 2

and then:
./main -m stories110M_Q4_0.gguf --temp 0 -n 128 -p "Once upon a time there was a boy name Timmy"

produces following story:


 Once upon a time there was a boy name Timmy. He was three years old and loved to play with his friends. One day, he went to the park with his mom.
At the park, Timmy saw a big, red ball. He ran over to it and started playing with it. Suddenly, he heard a loud noise. It was a man shouting at him.
Timmy was scared and ran back to his mom. She hugged him tight and said, "It's okay, Timmy. That man was just shouting because he was angry."
Timmy felt better and went back to playing with the ball. He had
llama_print_timings:        load time =   100.39 ms
llama_print_timings:      sample time =   101.32 ms /   128 runs   (    0.79 ms per token,  1263.31 tokens per second)
llama_print_timings: prompt eval time =    10.43 ms /    12 tokens (    0.87 ms per token,  1150.97 tokens per second)
llama_print_timings:        eval time =   235.06 ms /   127 runs   (    1.85 ms per token,   540.30 tokens per second)
llama_print_timings:       total time =   358.62 ms

@ggerganov
Copy link
Owner

Shall we merge this for now?

As mentioned by @klosax , since llama2.c outputs Pytorch models (https://huggingface.co/karpathy/tinyllamas/tree/main), an easy way to convert these to .gguf would be to adapt the convert.py script or make a dedicated convert-tinyllama-to-gguf.py script following convert-llama-7b-pth-to-gguf.py as an example.

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

FYI, I am also able to convert the previously generated .bin models to the .gguf if anyone is interested:

Does the perplexity of the converted models match the pytorch models ppl I posted above?

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 22, 2023

@klosax I tried getting perplexity with the wiki.text.raw file and I get the following:

./perplexity -m stories110M_Q4_0.gguf -f ~/Downloads/wikitext-2-raw/wiki.test.raw -c 256 -b 256
main: build = 1015 (226255b)
main: seed  = 1692674177
llama_model_loader: loaded meta data with 16 key-value pairs and 111 tensors from stories110M_Q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    3:           blk.0.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   10:            blk.0.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   11:              blk.0.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   12:           blk.1.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   19:            blk.1.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   20:              blk.1.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   21:           blk.2.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   25:         blk.2.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   28:            blk.2.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   29:              blk.2.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   30:           blk.3.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   34:         blk.3.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   37:            blk.3.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   38:              blk.3.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   39:           blk.4.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   43:         blk.4.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   46:            blk.4.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   47:              blk.4.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   48:           blk.5.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   52:         blk.5.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   55:            blk.5.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   56:              blk.5.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   57:           blk.6.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   61:         blk.6.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   64:            blk.6.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   65:              blk.6.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   66:           blk.7.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   70:         blk.7.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   73:            blk.7.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   74:              blk.7.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   75:           blk.8.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   79:         blk.8.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   82:            blk.8.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   83:              blk.8.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   84:           blk.9.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   88:         blk.9.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   91:            blk.9.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   92:              blk.9.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   93:          blk.10.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   97:        blk.10.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  100:           blk.10.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  101:             blk.10.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  102:          blk.11.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  106:        blk.11.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  109:           blk.11.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  110:             blk.11.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                        general.description str
llama_model_loader: - kv   3:                       llama.context_length u32
llama_model_loader: - kv   4:                     llama.embedding_length u32
llama_model_loader: - kv   5:                          llama.block_count u32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32
llama_model_loader: - kv   8:                 llama.attention.head_count u32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   25 tensors
llama_model_loader: - type q4_0:   85 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_load_internal: format       = GGUF V1 (latest)
llama_model_load_internal: arch         = llama
llama_model_load_internal: vocab type   = SPM
llama_model_load_internal: n_vocab      = 32000
llama_model_load_internal: n_ctx_train  = 2048
llama_model_load_internal: n_ctx        = 256
llama_model_load_internal: n_embd       = 768
llama_model_load_internal: n_head       = 12
llama_model_load_internal: n_head_kv    = 12
llama_model_load_internal: n_layer      = 12
llama_model_load_internal: n_rot        = 64
llama_model_load_internal: n_gqa        = 1
llama_model_load_internal: f_norm_eps   = 5.0e-06
llama_model_load_internal: n_ff         = 2048
llama_model_load_internal: freq_base    = 10000.0
llama_model_load_internal: freq_scale   = 1
llama_model_load_internal: model type   = 7B
llama_model_load_internal: model ftype  = mostly Q4_0
llama_model_load_internal: model size   = 0.13 B
llama_model_load_internal: general.name = stories110M
llama_model_load_internal: BOS token = 1 ''
llama_model_load_internal: EOS token = 2 ''
llama_model_load_internal: LF token  = 13 '<0x0A>'
llama_model_load_internal: ggml ctx size =    0.03 MB
llama_model_load_internal: mem required  =   78.08 MB (+    9.00 MB per state)
llama_new_context_with_model: kv self size  =    9.00 MB
llama_new_context_with_model: compute buffer total size =   33.41 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 1311 chunks, batch_size=256
perplexity: 0.34 seconds per pass - ETA 7.45 minutes
[1]2988.4638,[2]2599.1171,[3]2375.5471,[4]2610.4550,[5]2773.8172,[6]2355.5917,[7]2290.2706,[8]2331.8393,[9]2140.4298,[10]1955.6630,[11]1833.7892,[12]1779.1181,[13]1723.2024,[14]1621.9812,[15]1713.0870,[16]1677.8643,[17]1709.2472,[18]1687.9866,[19]1753.7342,[20]1860.9248,[21]1876.1218,[22]1893.8690,[23]1900.9944,[24]1937.7533,[25]1991.7355,[26]2035.5588,[27]2073.0839,[28]2161.1252,[29]2195.9277,[30]2273.2246,[31]2371.3181,[32]2430.8977,[33]2510.1441,[34]2443.8000,[35]2510.2068,[36]2596.1156,[37]2505.7430,[38]2449.6648,[39]2473.5058,[40]2518.3635,[41]2510.5044,[42]2470.4744,[43]2479.1911,[44]2484.3221,[45]2462.5569,[46]2472.9968,[47]2444.7031,[48]2429.4334,[49]2392.0132,[50]2392.4571,[51]2352.8066,[52]2318.2795,[53]2304.9537,[54]2259.3081,[55]2228.5797,[56]2204.2618,[57]2144.1473,[58]2113.2452,[59]2091.0091,[60]2071.4765,[61]2079.4586,[62]2034.0313,[63]1969.0720,[64]1958.0079,[65]1935.9611,[66]1931.3160,[67]1933.6617,[68]1918.4183,[69]1905.3737,[70]1875.3014,[71]1853.2032,[72]1838.6163,[73]1821.1407,[74]1823.6148,[75]1846.0096,[76]1855.7104,[77]1854.7927,[78]1859.1456,[79]1837.9158,[80]1835.3114,[81]1809.6053,[82]1828.7413,[83]1814.0148,[84]1791.7548,[85]1790.2202,[86]1800.8451,[87]1811.2721,[88]1819.6137,[89]1827.7371,[90]1813.5860,[91]1818.4633,[92]1803.0302,[93]1781.8412,[94]1789.7326,[95]1775.8790,[96]1769.6812,[97]1771.6715,[98]1775.0783,[99]1766.2419,[100]1758.3387,[101]1762.6205,[102]1766.0334,[103]1744.6702,[104]1739.2806,[105]1738.7616,[106]1734.3130,[107]1727.7114,[108]1718.4700,[109]1716.8194,[110]1721.8423,[111]1719.3520,[112]1730.5221,[113]1728.5433,[114]1729.5850,[115]1724.1004,[116]1730.0769,[117]1726.2247,[118]1712.5857,[119]1710.6590,[120]1709.8021,[121]1715.9996,[122]1721.5252,[123]1711.5297,[124]1701.7351,[125]1706.0504,[126]1705.8632,[127]1705.7759,[128]1703.5934,[129]1710.0679,[130]1727.0226,[131]1725.5245,[132]1722.7976,[133]1718.3541,[134]1721.6681,[135]1717.9872,[136]1721.9864,[137]1724.2701,[138]1728.9683,[139]1738.5287,[140]1739.3179,[141]1741.4491,[142]1743.7598,[143]1741.6157,[144]1743.7151,[145]1727.1957,[146]1729.9349,[147]1730.4708,[148]1721.1405,[149]1728.8043,[150]1724.8242,[151]1726.2219,[152]1727.6053,[153]1729.8523,[154]1732.2768,[155]1726.9174,[156]1724.1944,[157]1732.9372,[158]1732.4405,[159]1739.7288,[160]1729.7288,[161]1728.5167,[162]1734.2827,[163]1744.8525,[164]1743.6662,[165]1741.9997,[166]1740.3130,[167]1738.5944,[168]1741.4539,[169]1741.8724,[170]1733.2920,[171]1740.7823,[172]1756.7300,[173]1758.8287,[174]1742.1972,[175]1730.6131,[176]1726.0326,[177]1720.8042,[178]1716.5634,[179]1716.7800,[180]1718.5679,[181]1717.4774,[182]1718.9040,[183]1713.6320,[184]1715.3911,[185]1718.6068,[186]1722.8740,[187]1724.9507,[188]1721.8434,[189]1730.0544,[190]1729.9595,[191]1735.7678,[192]1736.6712,[193]1737.1822,[194]1742.7012,[195]1738.3537,[196]1736.5396,[197]1732.1012,[198]1722.4741,[199]1723.4276,[200]1716.3108,[201]1717.5669,[202]1712.7893,[203]1713.9793,[204]1718.2156,[205]1718.7035,[206]1713.9978,[207]1710.5174,[208]1712.8051,[209]1716.3343,[210]1711.4868,[211]1706.3049,[212]1702.8580,[213]1701.7952,[214]1700.3499,[215]1702.7100,[216]1705.5604,[217]1708.5147,[218]1705.9447,[219]1707.4529,[220]1705.8071,[221]1707.8324,[222]1710.2881,[223]1713.1548,[224]1718.4506,[225]1722.4794,[226]1724.2582,[227]1725.8733,[228]1730.5623,[229]1738.5420,[230]1735.9762,[231]1732.2393,[232]1737.3390,[233]1739.9970,[234]1737.4253,[235]1733.6204,[236]1736.1650,[237]1731.9010,[238]1729.4594,[239]1732.7302,[240]1733.8647,[241]1735.1145,[242]1740.3896,[243]1742.6579,[244]1737.7027,[245]1739.8132,[246]1742.1345,[247]1744.9405,[248]1740.8440,[249]1738.7026,[250]1742.1227,[251]1747.1232,[252]1749.6981,[253]1750.6120,[254]1755.6771,[255]1760.3834,[256]1760.7016,[257]1768.1426,[258]1768.4695,[259]1774.4452,[260]1774.9685,[261]1776.9565,[262]1781.2026,[263]1779.9279,[264]1782.6233,[265]1787.0149,[266]1788.0674,[267]1789.0152,[268]1791.7245,[269]1794.2274,[270]1797.9887,[271]1798.9724,[272]1807.1164,[273]1813.7950,[274]1819.8917,[275]1825.8586,[276]1832.8064,[277]1835.8325,[278]1836.2427,[279]1838.6573,[280]1841.9834,[281]1847.6999,[282]1855.7467,[283]1861.4022,[284]1866.8687,[285]1872.9548,[286]1878.6243,[287]1882.4111,[288]1888.3672,[289]1895.6198,[290]1904.0780,[291]1906.4732,[292]1905.1014,[293]1905.8220,[294]1906.7666,[295]1906.8717,[296]1907.6945,[297]1910.7216,[298]1908.8088,[299]1913.0583,[300]1910.4586,[301]1915.8138,[302]1922.2026,[303]1921.9159,[304]1921.8672,[305]1920.3552,[306]1916.0005,[307]1925.3016,[308]1930.4632,[309]1931.4061,[310]1932.7615,[311]1939.6599,[312]1944.9817,[313]1940.7207,[314]1945.5328,[315]1947.9280,[316]1953.9863,[317]1953.7691,[318]1961.4877,[319]1968.0685,[320]1976.6295,[321]1975.7128,[322]1977.4949,[323]1977.9229,[324]1974.0617,[325]1974.2626,[326]1963.7694,[327]1958.2813,[328]1954.2773,[329]1954.7525,[330]1952.6793,[331]1951.3776,[332]1949.7536,[333]1950.5108,[334]1944.6075,[335]1947.9326,[336]1942.3843,[337]1938.1743,[338]1939.0991,[339]1937.6495,[340]1933.6559,[341]1936.4993,[342]1933.1524,[343]1931.9416,[344]1927.4067,[345]1920.7365,[346]1918.9970,[347]1916.9392,[348]1916.1091,[349]1915.4521,[350]1913.7000,[351]1911.9648,[352]1908.8849,[353]1902.5170,[354]1900.3045,[355]1898.8329,[356]1897.0804,[357]1892.7875,[358]1890.2744,[359]1894.4428,[360]1892.6136,[361]1889.7142,[362]1887.8999,[363]1884.1496,[364]1882.6996,[365]1878.6724,[366]1874.0568,[367]1871.7191,[368]1869.2496,[369]1873.3630,[370]1874.6245,[371]1868.0143,[372]1870.3403,[373]1875.9646,[374]1877.4666,[375]1879.5434,[376]1878.2559,[377]1880.0369,[378]1883.7328,[379]1882.5584,[380]1883.9768,[381]1880.5196,[382]1874.8047,[383]1874.2227,[384]1871.7315,[385]1873.9057,[386]1873.8503,[387]1869.4005,[388]1863.9731,[389]1863.6106,[390]1864.0385,[391]1861.8635,[392]1864.4636,[393]1861.9791,[394]1859.4721,[395]1856.1467,[396]1857.6622,[397]1856.9136,[398]1856.4312,[399]1856.3426,[400]1859.6014,[401]1867.2721,[402]1867.1839,[403]1872.1280,[404]1878.6072,[405]1882.0050,[406]1883.0912,[407]1885.9578,[408]1884.6856,[409]1890.7269,[410]1892.2064,[411]1892.8802,[412]1894.1283,[413]1896.2759,[414]1896.0831,[415]1897.1928,[416]1897.6035,[417]1902.9053,[418]1902.7524,[419]1901.2574,[420]1903.2187,[421]1902.6653,[422]1903.7923,[423]1902.4742,[424]1901.0715,[425]1903.1715,[426]1903.8813,[427]1904.7561,[428]1903.0683,[429]1902.0048,[430]1900.9336,[431]1894.9887,[432]1888.9766,[433]1883.1242,[434]1882.3249,[435]1879.1611,[436]1876.6661,[437]1870.2863,[438]1868.3135,[439]1864.4839,[440]1859.4738,[441]1856.4365,[442]1853.1324,[443]1850.7021,[444]1847.6681,[445]1847.8916,[446]1847.0845,[447]1848.3851,[448]1849.5464,[449]1852.6126,[450]1854.4247,[451]1856.6510,[452]1859.8748,[453]1862.0178,[454]1868.6791,[455]1870.5257,[456]1871.3634,[457]1870.0813,[458]1872.8045,[459]1877.5139,[460]1881.1532,[461]1879.7479,[462]1880.4543,[463]1882.4820,[464]1885.5706,[465]1883.7788,[466]1885.9426,[467]1888.2414,[468]1886.1544,[469]1887.5641,[470]1891.4797,[471]1891.7052,[472]1893.0495,[473]1895.0802,[474]1895.4206,[475]1899.0009,[476]1903.0038,[477]1905.2412,[478]1904.4627,[479]1905.5097,[480]1907.5796,[481]1909.4482,[482]1910.4726,[483]1912.2958,[484]1913.2008,[485]1916.0482,[486]1919.6309,[487]1919.1033,[488]1919.8104,[489]1917.8115,[490]1918.6626,[491]1919.0131,[492]1917.6018,[493]1919.8816,[494]1920.9481,[495]1923.2783,[496]1925.1568,[497]1926.8001,[498]1926.8396,[499]1928.5526,[500]1930.6424,[501]1931.0397,[502]1930.8722,[503]1929.9071,[504]1930.3430,[505]1933.9307,[506]1935.0192,[507]1937.1141,[508]1938.0770,[509]1939.0537,[510]1939.3489,[511]1940.0978,[512]1939.3694,[513]1940.1381,[514]1941.7119,[515]1940.5698,[516]1941.6465,[517]1942.4335,[518]1945.5605,[519]1944.2940,[520]1945.1621,[521]1943.7729,[522]1942.9407,[523]1945.0953,[524]1945.4688,[525]1945.4943,[526]1948.0485,[527]1948.6618,[528]1947.2795,[529]1949.5896,[530]1951.3390,[531]1952.8518,[532]1956.3472,[533]1960.6169,[534]1962.5213,[535]1968.0218,[536]1972.8850,[537]1976.3264,[538]1981.6128,[539]1985.7201,[540]1988.9396,[541]1992.4135,[542]1996.7473,[543]1997.5507,[544]1998.3174,[545]2000.2639,[546]2000.9725,[547]2004.8282,[548]2006.8251,[549]2005.2836,[550]2005.7101,[551]2006.3744,[552]2005.6621,[553]2003.8817,[554]2004.9890,[555]2006.4416,[556]2007.5391,[557]2006.8183,[558]2008.1648,[559]2010.2235,[560]2014.1492,[561]2013.9039,[562]2014.6072,[563]2015.8407,[564]2016.1645,[565]2016.9725,[566]2016.4859,[567]2017.0970,[568]2015.8633,[569]2014.3292,[570]2016.0980,[571]2020.0928,[572]2024.1109,[573]2026.0630,[574]2026.0591,[575]2025.3851,[576]2027.8306,[577]2026.3335,[578]2029.9778,[579]2033.4734,[580]2031.7447,[581]2034.2644,[582]2035.6509,[583]2034.1251,[584]2036.7745,[585]2036.1902,[586]2035.4333,[587]2038.6508,[588]2039.6208,[589]2039.0540,[590]2038.8920,[591]2038.5549,[592]2036.3463,[593]2035.2879,[594]2034.7994,[595]2037.7952,[596]2039.6610,[597]2042.1071,[598]2043.9592,[599]2044.3637,[600]2045.4419,[601]2045.8022,[602]2046.4603,[603]2047.8878,[604]2048.7957,[605]2049.7695,[606]2049.2437,[607]2051.3910,[608]2053.1861,[609]2056.4576,[610]2059.1755,[611]2059.1408,[612]2059.9398,[613]2058.8150,[614]2057.3268,[615]2056.3119,[616]2056.9690,[617]2058.8591,[618]2060.2238,[619]2061.2269,[620]2061.0984,[621]2061.2921,[622]2059.6433,[623]2058.8053,[624]2060.7001,[625]2059.5796,[626]2061.2061,[627]2062.1027,[628]2061.4057,[629]2062.6144,[630]2062.8366,[631]2063.2308,[632]2064.3160,[633]2061.0827,[634]2059.0604,[635]2059.0871,[636]2054.3132,[637]2056.4127,[638]2056.2042,[639]2058.4417,[640]2058.8236,[641]2057.4855,[642]2058.4727,[643]2059.2117,[644]2061.1850,[645]2058.9035,[646]2059.8688,[647]2060.9857,[648]2059.0758,[649]2054.1181,[650]2050.8698,[651]2046.1086,[652]2041.0838,[653]2037.2020,[654]2035.5069,[655]2035.6623,[656]2033.0930,[657]2031.6546,[658]2032.0408,[659]2030.5966,[660]2029.3133,[661]2029.1840,[662]2028.5710,[663]2027.8770,[664]2030.4199,[665]2029.8451,[666]2029.8307,[667]2032.0432,[668]2033.6422,[669]2035.6173,[670]2035.6628,[671]2032.9698,[672]2034.0643,[673]2034.2129,[674]2033.9457,[675]2036.3239,[676]2038.4480,[677]2037.8916,[678]2035.3465,[679]2035.2423,[680]2031.7794,[681]2031.0354,[682]2027.5890,[683]2026.2327,[684]2022.8974,[685]2020.8063,[686]2019.5135,[687]2017.8178,[688]2017.2235,[689]2017.2303,[690]2015.8515,[691]2019.1250,[692]2018.2062,[693]2020.5094,[694]2020.4994,[695]2020.2430,[696]2018.9006,[697]2018.9798,[698]2018.5271,[699]2017.3191,[700]2017.9863,[701]2014.5140,[702]2011.2232,[703]2010.7262,[704]2011.7285,[705]2011.4391,[706]2008.4850,[707]2008.3706,[708]2011.6900,[709]2010.0954,[710]2004.6555,[711]2001.0651,[712]1997.6570,[713]1993.0646,[714]1988.6016,[715]1989.7781,[716]1989.8800,[717]1990.7063,[718]1991.9022,[719]1991.6017,[720]1991.3379,[721]1991.7122,[722]1989.4367,[723]1986.4391,[724]1987.1952,[725]1985.8314,[726]1985.3919,[727]1982.1427,[728]1981.4881,[729]1980.3402,[730]1979.3191,[731]1976.1036,[732]1973.1086,[733]1972.8398,[734]1968.7741,[735]1967.9692,[736]1966.3613,[737]1963.9664,[738]1961.0233,[739]1959.5416,[740]1954.2974,[741]1951.1424,[742]1949.0272,[743]1946.4402,[744]1942.5573,[745]1939.3276,[746]1937.2891,[747]1936.4473,[748]1933.9464,[749]1931.9657,[750]1929.5170,[751]1925.9627,[752]1926.4740,[753]1925.2313,[754]1923.1115,[755]1922.9533,[756]1920.9531,[757]1920.9353,[758]1921.1925,[759]1922.9224,[760]1922.1858,[761]1921.5295,[762]1921.6022,[763]1922.3301,[764]1923.6524,[765]1922.7126,[766]1923.8394,[767]1923.4004,[768]1921.7564,[769]1919.3972,[770]1918.8724,[771]1917.2546,[772]1917.3502,[773]1917.3850,[774]1917.2090,[775]1916.5124,[776]1916.9417,[777]1915.7837,[778]1914.7634,[779]1916.4190,[780]1916.5128,[781]1914.3692,[782]1912.0189,[783]1913.2693,[784]1912.6498,[785]1910.7000,[786]1911.5879,[787]1911.4244,[788]1911.1238,[789]1909.8232,[790]1908.6341,[791]1906.6997,[792]1905.0843,[793]1904.6099,[794]1902.9704,[795]1899.6915,[796]1898.8146,[797]1898.0550,[798]1895.0856,[799]1894.2160,[800]1894.6360,[801]1895.1059,[802]1891.7052,[803]1888.3597,[804]1889.1217,[805]1889.4266,[806]1889.8904,[807]1888.8055,[808]1888.2437,[809]1886.8227,[810]1886.8268,[811]1888.7743,[812]1889.2507,[813]1890.5442,[814]1889.6729,[815]1886.3974,[816]1881.2042,[817]1882.0142,[818]1881.6436,[819]1881.3426,[820]1880.0399,[821]1878.3006,[822]1880.5318,[823]1879.4345,[824]1879.8946,[825]1877.7738,[826]1876.0159,[827]1874.6861,[828]1875.6165,[829]1876.8696,[830]1876.7836,[831]1876.9070,[832]1877.5803,[833]1879.0532,[834]1879.1956,[835]1878.6709,[836]1880.1618,[837]1879.0756,[838]1878.6386,[839]1876.7023,[840]1877.5090,[841]1876.6857,[842]1876.0797,[843]1874.4680,[844]1874.4877,[845]1873.9658,[846]1874.1403,[847]1874.4280,[848]1873.9099,[849]1875.6087,[850]1873.8317,[851]1875.6610,[852]1874.2188,[853]1874.2951,[854]1873.1291,[855]1872.9284,[856]1870.5352,[857]1869.5864,[858]1870.2887,[859]1869.9062,[860]1869.4957,[861]1867.5115,[862]1868.8438,[863]1872.1975,[864]1873.5727,[865]1872.7889,[866]1873.8940,[867]1873.7320,[868]1875.3097,[869]1875.2636,[870]1875.6700,[871]1876.8867,[872]1879.1195,[873]1881.1757,[874]1882.5146,[875]1883.3268,[876]1884.3845,[877]1884.1367,[878]1885.2983,[879]1885.5128,[880]1887.9242,[881]1887.1750,[882]1886.3725,[883]1886.1020,[884]1885.5553,[885]1885.4521,[886]1886.1028,[887]1885.5370,[888]1885.9044,[889]1886.7618,[890]1886.6259,[891]1888.4515,[892]1890.9136,[893]1890.5523,[894]1892.0027,[895]1891.1076,[896]1893.8191,[897]1895.1572,[898]1895.9907,[899]1897.0435,[900]1894.7853,[901]1893.6660,[902]1892.7871,[903]1893.3561,[904]1894.8853,[905]1895.5107,[906]1897.0692,[907]1898.3178,[908]1899.9108,[909]1900.3944,[910]1901.8499,[911]1903.7115,[912]1901.8680,[913]1902.1745,[914]1901.9238,[915]1902.5767,[916]1902.1199,[917]1903.7089,[918]1903.2121,[919]1901.2033,[920]1899.9365,[921]1901.0054,[922]1897.9872,[923]1897.4266,[924]1897.3855,[925]1894.0317,[926]1894.4242,[927]1892.7330,[928]1893.2174,[929]1892.9819,[930]1893.7762,[931]1895.3311,[932]1897.1737,[933]1897.4956,[934]1897.3368,[935]1895.3227,[936]1896.8943,[937]1896.5619,[938]1897.0583,[939]1895.6940,[940]1895.5792,[941]1895.3577,[942]1895.5941,[943]1893.4172,[944]1892.2771,[945]1891.2871,[946]1892.2105,[947]1892.0092,[948]1893.6617,[949]1894.6017,[950]1897.1196,[951]1897.8814,[952]1897.2633,[953]1897.0268,[954]1896.8211,[955]1898.2085,[956]1898.5959,[957]1899.1747,[958]1900.9499,[959]1902.1155,[960]1902.1918,[961]1903.1366,[962]1903.3566,[963]1903.1588,[964]1901.1358,[965]1902.8408,[966]1901.9088,[967]1904.3045,[968]1905.1811,[969]1904.9343,[970]1904.6359,[971]1904.4621,[972]1903.4582,[973]1902.5254,[974]1904.6244,[975]1903.1350,[976]1904.6517,[977]1902.0870,[978]1903.5173,[979]1905.4133,[980]1905.8602,[981]1906.7283,[982]1907.4874,[983]1908.8164,[984]1908.4341,[985]1912.1724,[986]1912.5151,[987]1913.8359,[988]1914.5958,[989]1916.1750,[990]1918.0984,[991]1919.2674,[992]1919.8333,[993]1919.7935,[994]1919.3442,[995]1921.7563,[996]1922.4867,[997]1923.3400,[998]1922.8717,[999]1922.7540,[1000]1922.7381,[1001]1923.4883,[1002]1925.3109,[1003]1925.0212,[1004]1925.6195,[1005]1928.0030,[1006]1928.0994,[1007]1930.0900,[1008]1932.0485,[1009]1933.1367,[1010]1934.2864,[1011]1933.5437,[1012]1933.7366,[1013]1932.8661,[1014]1933.6812,[1015]1934.7485,[1016]1936.8938,[1017]1936.3642,[1018]1936.7711,[1019]1937.1785,[1020]1937.7344,[1021]1939.4993,[1022]1942.0535,[1023]1942.0586,[1024]1939.2490,[1025]1938.6765,[1026]1935.1882,[1027]1934.8422,[1028]1934.7591,[1029]1936.3684,[1030]1936.0177,[1031]1936.1433,[1032]1936.1719,[1033]1937.0471,[1034]1937.3228,[1035]1938.9127,[1036]1936.2757,[1037]1935.3670,[1038]1935.4239,[1039]1935.4193,[1040]1934.2385,[1041]1933.7418,[1042]1933.8332,[1043]1934.7438,[1044]1934.8554,[1045]1935.2124,[1046]1935.3372,[1047]1934.4202,[1048]1933.7441,[1049]1932.9413,[1050]1934.1502,[1051]1935.0689,[1052]1935.6812,[1053]1936.1825,[1054]1937.0674,[1055]1937.0655,[1056]1936.6508,[1057]1935.9366,[1058]1936.4353,[1059]1936.5216,[1060]1933.3964,[1061]1930.8305,[1062]1930.5427,[1063]1931.2715,[1064]1933.1962,[1065]1934.2839,[1066]1934.7621,[1067]1934.1434,[1068]1933.4777,[1069]1933.0105,[1070]1934.2081,[1071]1933.5695,[1072]1932.0345,[1073]1931.8207,[1074]1932.1558,[1075]1930.1991,[1076]1930.1157,[1077]1931.3454,[1078]1934.0196,[1079]1934.7349,[1080]1937.0615,[1081]1937.1474,[1082]1939.0823,[1083]1939.4335,[1084]1939.3547,[1085]1937.9433,[1086]1939.5803,[1087]1939.5621,[1088]1938.3170,[1089]1936.7054,[1090]1937.5227,[1091]1938.8676,[1092]1939.8347,[1093]1939.4700,[1094]1940.0683,[1095]1938.8753,[1096]1938.4999,[1097]1939.5656,[1098]1942.0175,[1099]1943.0201,[1100]1944.3277,[1101]1943.4579,[1102]1943.5762,[1103]1943.7947,[1104]1944.8668,[1105]1945.9049,[1106]1946.3198,[1107]1947.9202,[1108]1947.2044,[1109]1948.9661,[1110]1950.2535,[1111]1951.2420,[1112]1951.6736,[1113]1951.4192,[1114]1951.0494,[1115]1950.1903,[1116]1950.5588,[1117]1951.7410,[1118]1952.8930,[1119]1955.0887,[1120]1954.1935,[1121]1955.7211,[1122]1956.3863,[1123]1957.1161,[1124]1958.2871,[1125]1959.3397,[1126]1959.9522,[1127]1958.8655,[1128]1958.2934,[1129]1960.7998,[1130]1959.2707,[1131]1959.1532,[1132]1959.0798,[1133]1959.2201,[1134]1958.2514,[1135]1957.8920,[1136]1957.3643,[1137]1959.1702,[1138]1958.8975,[1139]1959.1003,[1140]1959.1501,[1141]1959.7008,[1142]1959.9292,[1143]1961.1949,[1144]1960.0993,[1145]1959.2898,[1146]1960.8773,[1147]1959.6893,[1148]1959.8496,[1149]1959.5127,[1150]1961.1435,[1151]1962.2208,[1152]1962.6667,[1153]1962.4299,[1154]1962.4433,[1155]1963.8469,[1156]1964.1703,[1157]1964.3960,[1158]1963.8047,[1159]1963.8976,[1160]1963.1581,[1161]1963.4729,[1162]1963.7110,[1163]1964.1610,[1164]1965.6108,[1165]1965.6076,[1166]1965.5837,[1167]1966.3983,[1168]1968.2855,[1169]1967.9436,[1170]1969.3469,[1171]1970.7696,[1172]1970.9193,[1173]1970.7958,[1174]1970.6482,[1175]1969.7487,[1176]1968.9990,[1177]1967.6548,[1178]1967.9417,[1179]1969.3034,[1180]1968.3293,[1181]1967.7599,[1182]1966.0844,[1183]1966.9785,[1184]1966.1292,[1185]1966.7072,[1186]1967.4810,[1187]1967.4425,[1188]1966.7392,[1189]1966.2018,[1190]1966.2057,[1191]1965.7729,[1192]1966.6801,[1193]1968.0794,[1194]1965.2956,[1195]1965.3062,[1196]1966.4297,[1197]1966.0683,[1198]1965.1383,[1199]1965.2108,[1200]1964.7887,[1201]1965.5209,[1202]1965.8343,[1203]1966.8361,[1204]1969.1502,[1205]1972.9984,[1206]1971.5092,[1207]1972.2818,[1208]1972.1158,[1209]1972.0128,[1210]1970.9958,[1211]1970.9464,[1212]1970.1333,[1213]1970.6367,[1214]1973.0428,[1215]1977.0888,[1216]1977.3848,[1217]1977.1167,[1218]1976.5210,[1219]1976.2538,[1220]1976.2681,[1221]1976.1697,[1222]1975.5296,[1223]1975.2652,[1224]1976.5803,[1225]1975.9708,[1226]1975.5644,[1227]1975.6898,[1228]1975.6897,[1229]1975.4804,[1230]1974.4075,[1231]1975.5068,[1232]1975.1834,[1233]1974.9897,[1234]1973.4203,[1235]1972.3303,[1236]1972.6492,[1237]1972.6092,[1238]1970.9641,[1239]1971.2549,[1240]1969.4430,[1241]1969.8413,[1242]1969.6397,[1243]1970.6006,[1244]1971.7661,[1245]1970.8360,[1246]1970.6309,[1247]1969.8033,[1248]1968.8782,[1249]1967.3864,[1250]1966.8790,[1251]1968.3667,[1252]1968.8611,[1253]1969.3629,[1254]1969.8136,[1255]1971.6799,[1256]1971.7038,[1257]1973.5017,[1258]1972.5897,[1259]1973.6962,[1260]1973.7400,[1261]1975.3610,[1262]1974.1961,[1263]1974.1371,[1264]1972.7038,[1265]1973.3344,[1266]1974.2882,[1267]1973.3245,[1268]1972.8468,[1269]1973.7344,[1270]1973.4786,[1271]1972.8952,[1272]1973.0128,[1273]1973.1935,[1274]1973.6895,[1275]1973.9576,[1276]1974.1076,[1277]1972.9543,[1278]1973.9078,[1279]1974.4565,[1280]1974.5845,[1281]1974.9953,[1282]1973.6104,[1283]1973.1028,[1284]1973.8968,[1285]1974.6892,[1286]1974.1403,[1287]1975.3340,[1288]1975.9632,[1289]1976.6858,[1290]1977.0790,[1291]1978.0064,[1292]1979.0281,[1293]1979.9720,[1294]1980.6077,[1295]1980.1680,[1296]1978.3252,[1297]1976.0485,[1298]1974.2204,[1299]1971.9573,[1300]1971.9921,[1301]1972.6301,[1302]1971.7476,[1303]1971.2024,[1304]1970.8142,[1305]1971.3864,[1306]1973.2771,[1307]1974.5534,[1308]1976.1113,[1309]1976.7438,[1310]1977.2194,[1311]1977.1616,

llama_print_timings:        load time =  1022.95 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 191611.31 ms / 335616 tokens (    0.57 ms per token,  1751.55 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 224068.04 ms

@byte-6174
Copy link
Contributor

@ggerganov

Shall we merge this for now?

As mentioned by @klosax , since llama2.c outputs Pytorch models (https://huggingface.co/karpathy/tinyllamas/tree/main), an easy way to convert these to .gguf would be to adapt the convert.py script or make a dedicated convert-tinyllama-to-gguf.py script following convert-llama-7b-pth-to-gguf.py as an example.

No new script needed, if we reinstate the original c code that converts from llama2.c to ggml, we can use convert-llama-ggmlv3-to-gguf.py to convert to gguf. see example above.

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

@klosax I tried getting perplexity with the wiki.text.raw file and I get the following:

Use the tinystories validation data instead: https://github.com/klosax/misc/blob/main/tinystories-valid.txt.1000

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

No new script needed, if we reinstate the original c code that converts from llama2.c to ggml, we can use convert-llama-ggmlv3-to-gguf.py to convert to gguf. see example above.

It would be much better to have the code output a model file in gguf instead of the old format.

@ggerganov
Copy link
Owner

Yup, either way works. We might even think about adding tests to the CI that convert/quantize/run the tiny llama models, so we have long term support. I think in the future such small models will become very useful and it would be nice to keep things stable and supported

@byte-6174
Copy link
Contributor

@klosax I tried getting perplexity with the wiki.text.raw file and I get the following:

Use the tinystories validation data instead: https://github.com/klosax/misc/blob/main/tinystories-valid.txt.1000

nope

/perplexity -m stories110M_Q4_0.gguf -f ~/Downloads/tinystories-valid.txt.1000 -c 256 -b 256
main: build = 1015 (226255b)
main: seed  = 1692706450
llama_model_loader: loaded meta data with 16 key-value pairs and 111 tensors from stories110M_Q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    1:               output_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    output.weight q6_K     [   768, 32000,     1,     1 ]
llama_model_loader: - tensor    3:           blk.0.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    5:              blk.0.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor    8:            blk.0.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   10:            blk.0.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   11:              blk.0.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   12:           blk.1.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   14:              blk.1.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   17:            blk.1.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   19:            blk.1.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   20:              blk.1.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   21:           blk.2.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   22:              blk.2.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   23:              blk.2.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   25:         blk.2.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   26:            blk.2.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   28:            blk.2.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   29:              blk.2.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   30:           blk.3.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   31:              blk.3.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   32:              blk.3.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   34:         blk.3.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   35:            blk.3.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   37:            blk.3.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   38:              blk.3.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   39:           blk.4.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   40:              blk.4.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   41:              blk.4.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   43:         blk.4.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   44:            blk.4.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   46:            blk.4.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   47:              blk.4.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   48:           blk.5.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   49:              blk.5.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   50:              blk.5.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   52:         blk.5.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   53:            blk.5.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   55:            blk.5.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   56:              blk.5.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   57:           blk.6.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   58:              blk.6.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   59:              blk.6.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   61:         blk.6.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   62:            blk.6.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   64:            blk.6.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   65:              blk.6.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   66:           blk.7.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   67:              blk.7.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   68:              blk.7.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   70:         blk.7.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   71:            blk.7.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   73:            blk.7.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   74:              blk.7.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   75:           blk.8.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   76:              blk.8.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   77:              blk.8.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   79:         blk.8.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   80:            blk.8.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   82:            blk.8.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   83:              blk.8.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   84:           blk.9.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   85:              blk.9.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   86:              blk.9.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   88:         blk.9.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   89:            blk.9.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   91:            blk.9.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor   92:              blk.9.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor   93:          blk.10.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   94:             blk.10.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   95:             blk.10.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   97:        blk.10.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor   98:           blk.10.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  100:           blk.10.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  101:             blk.10.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  102:          blk.11.attn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  103:             blk.11.attn_q.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  104:             blk.11.attn_k.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.attn_v.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  106:        blk.11.attn_output.weight q4_0     [   768,   768,     1,     1 ]
llama_model_loader: - tensor  107:           blk.11.ffn_norm.weight f32      [   768,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_gate.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - tensor  109:           blk.11.ffn_down.weight q4_0     [  2048,   768,     1,     1 ]
llama_model_loader: - tensor  110:             blk.11.ffn_up.weight q4_0     [   768,  2048,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                        general.description str
llama_model_loader: - kv   3:                       llama.context_length u32
llama_model_loader: - kv   4:                     llama.embedding_length u32
llama_model_loader: - kv   5:                          llama.block_count u32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32
llama_model_loader: - kv   8:                 llama.attention.head_count u32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:               general.quantization_version u32
llama_model_loader: - type  f32:   25 tensors
llama_model_loader: - type q4_0:   85 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_load_internal: format       = GGUF V1 (latest)
llama_model_load_internal: arch         = llama
llama_model_load_internal: vocab type   = SPM
llama_model_load_internal: n_vocab      = 32000
llama_model_load_internal: n_ctx_train  = 2048
llama_model_load_internal: n_ctx        = 256
llama_model_load_internal: n_embd       = 768
llama_model_load_internal: n_head       = 12
llama_model_load_internal: n_head_kv    = 12
llama_model_load_internal: n_layer      = 12
llama_model_load_internal: n_rot        = 64
llama_model_load_internal: n_gqa        = 1
llama_model_load_internal: f_norm_eps   = 5.0e-06
llama_model_load_internal: n_ff         = 2048
llama_model_load_internal: freq_base    = 10000.0
llama_model_load_internal: freq_scale   = 1
llama_model_load_internal: model type   = 7B
llama_model_load_internal: model ftype  = mostly Q4_0
llama_model_load_internal: model size   = 0.13 B
llama_model_load_internal: general.name = stories110M
llama_model_load_internal: BOS token = 1 ''
llama_model_load_internal: EOS token = 2 ''
llama_model_load_internal: LF token  = 13 '<0x0A>'
llama_model_load_internal: ggml ctx size =    0.03 MB
llama_model_load_internal: mem required  =   78.08 MB (+    9.00 MB per state)
llama_new_context_with_model: kv self size  =    9.00 MB
llama_new_context_with_model: compute buffer total size =   33.41 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 170 chunks, batch_size=256
perplexity: 0.37 seconds per pass - ETA 1.03 minutes
[1]2.2966,[2]3.7819,[3]3.2457,[4]3.5414,[5]3.9615,[6]4.1411,[7]4.3388,[8]4.4466,[9]4.0199,[10]4.0540,[11]4.2580,[12]4.0907,[13]4.1147,[14]4.2165,[15]4.1841,[16]4.0031,[17]4.0654,[18]3.9428,[19]4.0352,[20]4.0794,[21]4.0844,[22]4.0964,[23]4.1748,[24]4.1625,[25]4.2181,[26]4.2878,[27]4.3478,[28]4.2276,[29]4.1960,[30]4.2414,[31]4.2820,[32]4.1806,[33]4.2104,[34]4.2903,[35]4.3012,[36]4.2529,[37]4.1744,[38]4.1491,[39]4.1725,[40]4.1842,[41]4.1446,[42]4.1404,[43]4.0897,[44]4.0310,[45]3.9867,[46]4.0057,[47]3.9587,[48]3.9691,[49]4.0034,[50]4.0073,[51]3.9789,[52]3.9851,[53]3.9490,[54]3.9590,[55]3.9542,[56]3.9499,[57]3.9521,[58]3.9590,[59]3.9912,[60]4.0129,[61]4.0222,[62]4.0267,[63]4.0315,[64]4.0392,[65]4.0710,[66]4.0704,[67]4.0799,[68]4.0515,[69]3.9967,[70]3.9419,[71]3.9469,[72]3.9138,[73]3.8742,[74]3.8283,[75]3.8232,[76]3.8299,[77]3.8357,[78]3.8577,[79]3.8789,[80]3.8434,[81]3.8533,[82]3.8252,[83]3.7923,[84]3.8005,[85]3.7750,[86]3.7475,[87]3.7161,[88]3.7281,[89]3.7140,[90]3.7207,[91]3.7272,[92]3.7068,[93]3.7047,[94]3.7114,[95]3.7217,[96]3.7387,[97]3.7493,[98]3.7601,[99]3.7702,[100]3.7778,[101]3.7530,[102]3.7727,[103]3.7801,[104]3.7548,[105]3.7265,[106]3.7308,[107]3.7401,[108]3.7485,[109]3.7579,[110]3.7339,[111]3.7301,[112]3.7509,[113]3.7643,[114]3.7383,[115]3.7381,[116]3.7209,[117]3.7065,[118]3.6874,[119]3.6903,[120]3.7002,[121]3.6830,[122]3.6877,[123]3.7205,[124]3.7241,[125]3.6975,[126]3.6764,[127]3.6773,[128]3.6889,[129]3.6789,[130]3.6673,[131]3.6721,[132]3.6581,[133]3.6678,[134]3.6716,[135]3.6875,[136]3.6904,[137]3.6995,[138]3.7071,[139]3.6992,[140]3.7002,[141]3.7138,[142]3.6986,[143]3.6829,[144]3.6924,[145]3.6787,[146]3.6660,[147]3.6751,[148]3.6566,[149]3.6682,[150]3.6826,[151]3.6635,[152]3.6442,[153]3.6319,[154]3.6410,[155]3.6244,[156]3.6392,[157]3.6229,[158]3.6412,[159]3.6240,[160]3.6104,[161]3.5949,[162]3.5786,[163]3.5608,[164]3.5594,[165]3.5411,[166]3.5320,[167]3.5237,[168]3.5177,[169]3.5121,[170]3.5209,

llama_print_timings:        load time =   435.89 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 24961.12 ms / 43520 tokens (    0.57 ms per token,  1743.51 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 29204.66 ms

@klosax
Copy link
Contributor

klosax commented Aug 22, 2023

Use the tinystories validation data instead: https://github.com/klosax/misc/blob/main/tinystories-valid.txt.1000

nope

The ppl you got 3.5209 is similiar to mine 3.5234 on the 100m-q4_0.
I guess the difference could depend on compile options.

@byte-6174
Copy link
Contributor

ok and,
4.5931 for 42M
5.6428 for 15M

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

No new script needed, if we reinstate the original c code that converts from llama2.c to ggml, we can use convert-llama-ggmlv3-to-gguf.py to convert to gguf. see example above.

@byte-6174 @ggerganov I like this temporary middle ground, I'll push something along those lines in this PR tonight (w/ readme update)

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

I've reinstated the output code + updated the readme w/ instructions for gguf conversion & more concrete example (w/ emphasis on using llama2.c/tokenizer.bin, since models/ggml-vocab.bin is gone). Notice --eps moved from main to the gguf converter tool.

I'm getting a weird newline token (<0x0A>) in the resulting generation (not sure which stage is responsible), but seems orthogonal with this PR at this point:

<s> One day, Lily met a Shoggoth. She was scared and excited at the same time.<0x0A>"What is that?" asked Lily.<0x0A>The Shoggoth said, "I am a magical mineral. I can grant wishes to you."<0x0A>Lily was very curious. She asked the Shoggle what she could wish for.<0x0A>"How about a new toy?" said the Shoggin.<0x0A>Lily thought for a moment and then said, "No, I want something nice." <0x0A>The Shoggal smiled and said, "I have an idea. If you trust me, I will make your wish come true."<0x0A>Lily nodded and said she would trust the Shoggy. He waved his hand and a big rainbow appeared in the sky. Lily was so excited! She hugged the Shoggle and thanked him for granting her wish. <s>...

@byte-6174
Copy link
Contributor

what is the command that is generating this? - i didn't notice this with already converted .bin models from before.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

what is the command that is generating this? - i didn't notice this with already converted .bin models from before.

Me neither, might be happening through gguf conversion? (Edit: or more likely the switch from llama.cpp/models/ggml-vocab.bin to llama2.c/tokenizer.bin)

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin

make clean && make -j

./convert-llama2c-to-ggml \
  --copy-vocab-from-model ../llama2.c/tokenizer.bin \
  --llama2c-model stories42M.bin \
  --llama2c-output-model stories42M.ggmlv3.bin

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 \
  --input stories42M.ggmlv3.bin \
  --output stories42M.gguf.bin

./main -m stories42M.gguf.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

@byte-6174 I've now also commented out the code that was reading vocab from a ggmlv3 model, since that also needs switching to gguf.

@byte-6174
Copy link
Contributor

what is the command that is generating this? - i didn't notice this with already converted .bin models from before.

Me neither, might be happening through gguf conversion? (Edit: or more likely the switch from llama.cpp/models/ggml-vocab.bin to llama2.c/tokenizer.bin)

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin

make clean && make -j

./convert-llama2c-to-ggml \
  --copy-vocab-from-model ../llama2.c/tokenizer.bin \
  --llama2c-model stories42M.bin \
  --llama2c-output-model stories42M.ggmlv3.bin

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 \
  --input stories42M.ggmlv3.bin \
  --output stories42M.gguf.bin

./main -m stories42M.gguf.bin -p "One day, Lily met a Shoggoth" -n 500 -c 256

is the above sequence behaving as expected, or there some issues here?!

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 22, 2023

is the above sequence behaving as expected, or there some issues here?!

It's the command that generated the mostly good-looking output from above w/ those pesky <0x0A> tokens instead of newlines.

@byte-6174
Copy link
Contributor

I used the ggml-tokenizer.bin to convert and the LF is newline as expected. We can include back the old tokenizer for conversion purposes here. give it a try and lmk.

@byte-6174
Copy link
Contributor

this is the output :

 Once upon a time there was a boy name Timmy. He was three years old and he loved to play outside. One day, Timmy went to the park with his mom.
At the park, Timmy saw a big tree. He wanted to climb it so he asked his mom if he could. His mom said yes, but she told him to be careful.
Timmy started to climb the tree. He was having so much fun! But then, he slipped and fell down. He felt embarrassed.
His mom came over and hugged him. She said it was okay and that everyone falls sometimes. Timmy smiled and they went

@byte-6174
Copy link
Contributor

this worked without your changes btw. So I converted to .bin using the older code (i.e. without the untying changes you did that are included in ur fork). While converting this, i gave the ggml-vocab.bin to --copy-vocab-from-model
and then used the python script to convert to gguf to get the output as expected above.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

this worked without your changes btw. So I converted to .bin using the older code (i.e. without the untying changes you did that are included in ur fork). While converting this, i gave the ggml-vocab.bin to --copy-vocab-from-model and then used the python script to convert to gguf to get the output as expected above.

The issue isn't in this PR's changes (I did just check though 😅), it did work indeed before because the converter was using ggml-vocab.bin and now it has to use tokenizer.bin... and... (drumrolls)... they're kind of different (or encoded differently, please read on). You'd get the same results by using tokenizer.bin with the old code.

I've dumped the content of each if you want to inspect:

At index 13 you'll see:

  • A newline token "\n" (single 0x0a byte) in ggml-vocab.bin
  • Our suspicious string "<0x0A>" (6 bytes) in tokenizer.bin

The two vocabularies are identical from index 259 onward, before that index 3 to 258 tokenizer.bin's tokens are a <0xhex> string representation of the single byte token found in ggml-vocab.bin.txt

And that's... by design. If you look at the decode function in llama2.c/run.c, there's some special-casing of these single-byte tokens.

So we'll just need to convert these in the converter, which I'm happy to do in another PR, as this seems out of scope here.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

We can include back the old tokenizer for conversion purposes here. give it a try and lmk.

@byte-6174 Not sure we can load the old tokenizer as easily in the new gguf world, as the converter was loading it through llama_load_model_from_file, which is now demanding some gguf model / returning NULL for that legacy ggml-vocab.bin. It should be easy to load any gguf model's vocabulary though, I've left a TODO in load_vocab to that effect.

(out of curiosity I tried to convert ggml-vocab.bin to gguf but it seems it's GGJTv1 and the gguf script only supports GGJTv3 inputs)

@byte-6174
Copy link
Contributor

got it, thanks for dumping the 2 tokenizer, that was helpful to see..

what if we keep everything as is in a way it works with the old ggml-vocab.bin that allows us to get the bin file like before gguf. and then in the readme ask the user to use the python script to convert to gguf compatible model.
This could be an intermediate solution until we figure out a permanent solution.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

what if we keep everything as is in a way it works with the old ggml-vocab.bin

Not sure how to do this, is there any GGJTv1 parsing code left in the repo? And might be a bit awkward to reintroduce a file that was just deleted.

then in the readme ask the user to use the python script to convert to gguf compatible model.

Yes to this bit until we use the gguf output api (readme already updated) 👌

until we figure out a permanent solution.

I think it’s acceptable to have these temporary unwanted hex strings in lieu of single byte tokens (better than master that stopped outputting any conversion).

And introduce the token special casing as a follow up fix, happy to take a stab at it tomorrow night unless you have spare cycles for it.

@byte-6174
Copy link
Contributor

@ochafik please feel free to give this a go. If you ask me, we can simply restore the original script. We can have separate PRs to improve upon this.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

@byte-6174 Pushed a commit that special-cases these tokens the same way llama2.c does, and the readme instructions / #2685 (comment) produce nice output again.

@ggerganov This PR is hopefully good to merge :-)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦙

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

Thanks @ggerganov & @byte-6174 🦙🦙✌️

As a follow up I've dumped some draft gguf output code in #2751 but won't have the time to debug it today.

YellowRoseCx added a commit to YellowRoseCx/koboldcpp-rocm that referenced this pull request Aug 25, 2023
commit 3416c986d9d9a31c3cdefd7e7bd4d9438d72ba35
Merge: 5eb17f0 4c4e435
Author: YellowRoseCx <[email protected]>
Date:   Fri Aug 25 13:46:56 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 5eb17f02c8638e003bb91bddf95ccf54d2ad0c12
Author: YellowRoseCx <[email protected]>
Date:   Fri Aug 25 13:38:21 2023 -0500

    ROCm Port update

    * use hipblas based on cublas
    * Update Makefile for the Cuda kernels
    * Expand arch list and make it overrideable
    * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5)
    * add hipBLAS to README
    * new build arg LLAMA_CUDA_MMQ_Y
    * fix half2 decomposition
    * Add intrinsics polyfills for AMD
    * AMD assembly optimized __dp4a
    * Allow overriding CC_TURING
    * use "ROCm" instead of "CUDA"
    * ignore all build dirs
    * Add Dockerfiles
    * fix llama-bench
    * fix -nommq help for non CUDA/HIP

    ---------

    Co-Authored-By: YellowRoseCx <[email protected]>
    Co-Authored-By: ardfork <[email protected]>
    Co-Authored-By: funnbot <[email protected]>
    Co-Authored-By: Engininja2 <[email protected]>
    Co-Authored-By: Kerfuffle <[email protected]>
    Co-Authored-By: jammm <[email protected]>
    Co-Authored-By: jdecourval <[email protected]>

commit 4c4e4358ed54c397d3f0f5bc268f1ac59d909f57
Author: Concedo <[email protected]>
Date:   Thu Aug 24 22:12:56 2023 +0800

    fixed linux build error

commit 661bede62fe216632d099678a9dac08de7a68a4e
Author: Concedo <[email protected]>
Date:   Thu Aug 24 21:16:16 2023 +0800

    optimize tokenize method

commit b95a4ccb228ebfac12e5ce4b445f073ca67b99d2
Author: Concedo <[email protected]>
Date:   Thu Aug 24 20:41:49 2023 +0800

    added a token counting endpoint, set mmq as default

commit 81a0ef342ce1e583f6a5b060252565dbd59e1d8d
Author: Concedo <[email protected]>
Date:   Thu Aug 24 16:26:38 2023 +0800

    updated lite, switched to unminified source

commit 598d4d89ab3aaa539ddf36784306071f1411814a
Author: Concedo <[email protected]>
Date:   Thu Aug 24 15:45:33 2023 +0800

    fix for config file loading. from kcpp settings file

commit a3b994962673e681aafd9503781c7470acdcc63f
Merge: b8372d4 2d86b2e
Author: Concedo <[email protected]>
Date:   Thu Aug 24 15:22:17 2023 +0800

    Merge remote-tracking branch 'pop/add_config_arg' into concedo_experimental

commit b8372d44666531f5d17cbe264912fbe5548fd54b
Merge: 8263fd7 6e91a1b
Author: Concedo <[email protected]>
Date:   Thu Aug 24 15:21:24 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	.gitignore
    #	README.md
    #	tests/CMakeLists.txt

commit 6e91a1b0706c2e0e52b9d9be7ee82d3c1e7a33c1
Author: Evan Jones <[email protected]>
Date:   Thu Aug 24 00:07:13 2023 -0400

    llama : fix grammar sometimes generating null char (#2756)

commit 44d5462b5cddc1c5cbcd7647646f7b55b175b01f
Author: Georgi Gerganov <[email protected]>
Date:   Wed Aug 23 23:44:19 2023 +0300

    readme : fix link

commit c7868b075377c8c3fa916ea7c1aca600f44bed55
Author: Georgi Gerganov <[email protected]>
Date:   Wed Aug 23 23:43:00 2023 +0300

    minor : fix trailing whitespace

commit 79da24b58c1ea72340e64f799a4717d372207676
Author: Georgi Gerganov <[email protected]>
Date:   Wed Aug 23 23:41:16 2023 +0300

    readme : update hot topics

commit cf658adc832badaaa2ca119fe86070e5a830f8f6
Author: Georgi Gerganov <[email protected]>
Date:   Wed Aug 23 23:08:04 2023 +0300

    llm : add Falcon support (#2717)

    * llama : refactor GGUF constants into static maps

    * llama : check if model architecture is known

    * llama : refactor llama_model_load_internal()

    * gguf : add KV constant maps

    * llm : read arch-specific KVs

    * convert : add dummy scores + types

    * falcon : load tensor data (CPU only)

    * llama : fix loading progress bar

    * llama : add arch member to llama_model

    * falcon : CPU inference working

    * falcon : support non-40B models

    * falcon : minor

    * llama : minor updates

    ggml-ci

    * convert-falcon-hf-to-gguf.py : fix special token mapping

    * llama.cpp : llama default UNK token = id 0

    * llama.cpp : fix bpe tokenizer

    * llama.cpp : fix the fix of bpe tokenizer

    * ggml : pass eps to ggml_norm

    * metal : implement RoPE (mode = 2) + avoid ggml_repeat

    * ggml : ggml_repeat always creates new tensor

    * falcon : copy-paste self-attention from LLaMA

    * metal : print extra compute pipeline info

    * falcon : minor changes (still chasing the Metal problem)

    * llama.cpp : fix linefeed token

    * metal : fix GELU kernel numerical stability by using precise::tanh

    * metal : temporary workaround for the concurrency optimization bug

    * falcon : add CUDA offloading (#2739)

    * llama : better model naming and size reporting

    * llama : prep new tokenizer support

    * llama : advanced BPE tokenizer based on ggllm.cpp imlpementation

    * llama : remove oboslete comment

    ggml-ci

    * common : remove obsolete BPE API + disable test-tokenizer-1

    * llama : revert BPE special-case in llama_byte_to_token()

    * cuda : add TODOs for RoPE NeoX implementation

    * llama : default special tokens based on vocab type

    * perplexity : add log for start of tokenization

    ---------

    Co-authored-by: klosax <[email protected]>
    Co-authored-by: slaren <[email protected]>

commit a192860cfec89a38d59a943623bf595b1fe4495b
Author: Georgi Gerganov <[email protected]>
Date:   Wed Aug 23 22:37:39 2023 +0300

    minor : fix trailing whitespace

commit 95385241a91a616788a3bb76d12c9b7b2379ca2d
Author: Olivier Chafik <[email protected]>
Date:   Wed Aug 23 20:33:05 2023 +0100

    examples : restore the functionality to import llama2.c models (#2685)

    * Fix import of llama2.c models that don't share weights between embedding layers

    * llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv

    * llama2.c: comment out legacy "load from ggml model" logic

    * llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin

commit 335acd2ffd7b04501c6d8773ab9fcee6e7bf8639
Author: slaren <[email protected]>
Date:   Wed Aug 23 16:46:54 2023 +0200

    fix convert-lora-to-ggml.py (#2738)

commit 5290c38e6e9b66ee2b543e560e301c1a1a90929c
Author: klosax <[email protected]>
Date:   Wed Aug 23 16:46:03 2023 +0200

    main : insert bos if no tokens (#2727)

    * main.cpp : insert bos if no tokens

    * Update examples/main/main.cpp

    * Update examples/main/main.cpp

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

commit cc34dbda9681418a2b18382446b90cdcec398d82
Author: akawrykow <[email protected]>
Date:   Wed Aug 23 07:31:34 2023 -0700

    gitignore : fix for windows (#2729)

commit 7c2227a1972a4add4b5c118e4914c086513d0382
Author: Cebtenzzre <[email protected]>
Date:   Wed Aug 23 10:29:09 2023 -0400

    chmod : make scripts executable (#2675)

commit f19dca04ea5fbf9a0b2753091d93464585d5c73b
Author: JohnnyB <[email protected]>
Date:   Wed Aug 23 15:28:22 2023 +0100

    devops : RPM Specs (#2723)

    * Create llama-cpp.srpm

    * Rename llama-cpp.srpm to llama-cpp.srpm.spec

    Correcting extension.

    * Tested spec success.

    * Update llama-cpp.srpm.spec

    * Create lamma-cpp-cublas.srpm.spec

    * Create lamma-cpp-clblast.srpm.spec

    * Update lamma-cpp-cublas.srpm.spec

    Added BuildRequires

    * Moved to devops dir

commit 8263fd7bdb247f2c3ff21debb50b22bd9b030339
Author: askmyteapot <[email protected]>
Date:   Thu Aug 24 00:15:48 2023 +1000

    Update llama_v3.cpp (#393)

    Fixing C2065 compiler error.
    Missed '3' on 3 separate identifiers (kB > kB3, MB > MB3)

commit bfdc596d58fbd9bbadd2352705af4373005e1411
Author: Concedo <[email protected]>
Date:   Wed Aug 23 19:19:52 2023 +0800

    gguf reader in file format detection

commit 8207214b6a37a46526cee9e72d4c9092b9d1872f
Author: Kawrakow <[email protected]>
Date:   Wed Aug 23 12:57:12 2023 +0300

    Fix values shown in the quantize tool help (#2735)

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 62959e740e8759d246ac8d09036950efde09981c
Author: Kawrakow <[email protected]>
Date:   Wed Aug 23 12:56:42 2023 +0300

    Strided perplexity (#2714)

    * Implementing strided computation of perplexity

    * Alternative way to output PPL results

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 7f7ddd5002040804e33fcdbde44aa22f8635f57d
Author: IgnacioFDM <[email protected]>
Date:   Wed Aug 23 06:31:09 2023 -0300

    Fix ggml to gguf conversion on Windows (#2733)

    This fixes `RuntimeWarning: overflow encountered in long_scalars`

    Credit: anon (not mine)

commit af170fc2db1186d3002b602d909c52c22de4a076
Merge: 981c913 b8ad1b6
Author: Concedo <[email protected]>
Date:   Wed Aug 23 17:08:09 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	README.md
    #	llama.cpp
    #	scripts/sync-ggml.sh
    #	tests/test-tokenizer-0.cpp

commit 981c9131f0f20c10099735c1e353534b5bfe1e59
Author: Concedo <[email protected]>
Date:   Wed Aug 23 16:07:07 2023 +0800

    gguf for llama is working

commit b8ad1b66b23f9b2e6e4531e9a62753323036a556
Author: Xiao-Yong Jin <[email protected]>
Date:   Wed Aug 23 02:12:12 2023 -0500

    server : allow json array in prompt or content for direct token input (#2306)

    * server: allow json array in prompt or content

    We accept an array of strings and numbers representing tokens,
    in addition to the current string valued prompt or content.

    This allows direct token input, so that any special tokens
    can be processed and used at the frontend during the construction
    of the json data, before sending to the server. And the server
    does not need to know or parse special tokens from textual input.

    With this, we can use EOS and BOS used in llama-2-chat models.

    * server: use tokenizePrompt(json) and default "" if empty prompt

    * server: fix prompt check

    * server: tokenize endpoint no longer adds BOS

commit f5fe98d11bdf9e7797bcfb05c0c3601ffc4b9d26
Author: Evan Jones <[email protected]>
Date:   Tue Aug 22 21:01:57 2023 -0400

    docs : add grammar docs (#2701)

    * docs : add grammar docs

    * tweaks to grammar guide

    * rework GBNF example to be a commented grammar

commit 777f42ba18b29f25c71ff8de3ecf97b8017304c0
Author: Kerfuffle <[email protected]>
Date:   Tue Aug 22 17:39:39 2023 -0600

    Improve handling of special tokens in GGML to GGUF converter (#2725)

    * Improve UNK, BOS, EOS token handling when converting without metadata.

    * Allow importing as a module.

    * Remove some obsolete code and minor cleanups.

    * Set default UNK token mapping from -1 to 0 in llama.cpp

    * Try to handle overflow due to buggy Windows Python with a better error message

commit 46ef5b5fcf4c366e1fb27726b6394adbbf8fd0ea
Author: goerch <[email protected]>
Date:   Tue Aug 22 23:10:42 2023 +0200

    llama : fix whitespace escaping in tokenizer (#2724)

commit c63bb1d16a70c03440671b76954bb767513cead8
Author: Johannes Gäßler <[email protected]>
Date:   Tue Aug 22 22:47:05 2023 +0200

    CUDA: use mul_mat_q kernels by default (#2683)

commit 3b6cfe7c927df178ca3c11643c3ec93e143471c9
Author: Alex Petenchea <[email protected]>
Date:   Tue Aug 22 21:58:16 2023 +0300

    convert.py : clarifying error message (#2718)

commit 800c9635b4a9390126f397870f3a825fc7455bd1
Author: Jiahao Li <[email protected]>
Date:   Wed Aug 23 02:27:06 2023 +0800

    Fix CUDA softmax by subtracting max value before exp (#2665)

commit deb7dfca4b9725cd295d1426db75fe8e0a6d5312
Author: Georgi Gerganov <[email protected]>
Date:   Tue Aug 22 20:05:59 2023 +0300

    gguf : add ftype meta info to the model (#2710)

    * llama : add ftype meta info to the model

    ggml-ci

    * convert.py : add ftype when converting (does not work)

    * convert.py : fix Enum to IntEnum

    ggml-ci

commit bac66994cf356cf488078c056831396eb4ce31d5
Author: Kawrakow <[email protected]>
Date:   Tue Aug 22 19:14:09 2023 +0300

    Quantization imrovements for k_quants (#2707)

    * Improve LLaMA-2 2-, 3- and 4-bit quantization

    * Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2
    * Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2
    * Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of
      attention.wv and feed_forward.w2

    This leads to a slight model sized increase as follows:
    Q2_K  : 2.684G vs 2.670G
    Q3_K_S: 2.775G vs 2.745G
    Q3_K_M: 3.071G vs 3.057G
    Q4_K_S: 3.592G vs 3.563G

    LLaMA-2 PPL for context 512 changes as follows:
    Q2_K  : 6.6691 vs 6.8201
    Q3_K_S: 6.2129 vs 6.2584
    Q3_K_M: 6.0387 vs 6.1371
    Q4_K_S: 5.9138 vs 6.0041

    There are improvements for LLaMA-1 as well, but they are
    way smaller than the above.

    * Minor 4-bit quantization improvement

    For the same model size as previus commit, we get
    PPL = 5.9069 vs 5.9138.

    * Some more fine tuning

    * Adding make_qkx2_quants

    With it, we get PPL = 5.8828 for L2-7B Q4_K_S.

    * Another minor improvement

    * Q2_K improvement

    Smaller model, lower perplexity.
     7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201
    12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178

    It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk,
    which are Q2_K

    * Iterating

    * Revert Q5_K back to make_qkx1_quants

    * Better Q6_K

    * make_qkx2_quants is better for Q5_K after all

    * Fix after rebasing on master

    * Fix for changed tensor names

    ---------

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 39cc83e8c9fafe1494c4996b07f97afed29c9f27
Merge: 2d17c22 6381d4e
Author: Concedo <[email protected]>
Date:   Tue Aug 22 23:12:47 2023 +0800

    incomplete merge, compiles but generates rubbish

commit 519c981f8b65ee6c87c2965539685ced0a17223b
Author: slaren <[email protected]>
Date:   Tue Aug 22 16:03:12 2023 +0200

    embedding : evaluate prompt in batches (#2713)

commit 1123f7fbdfb8012e46f05e903e6f675922916378
Author: slaren <[email protected]>
Date:   Tue Aug 22 15:25:19 2023 +0200

    ggml-cuda : use graph allocator (#2684)

    use a different function for no_alloc to avoid breaking backwards compat, fixes lora

    remove 512 n_batch limit

    fixed 2048 batch size

    cleanup

    Co-authored-by: Johannes Gäßler <[email protected]>

commit ef3f333d3775600d1646a9fa249aca532d15fb89
Author: Georgi Gerganov <[email protected]>
Date:   Tue Aug 22 14:22:08 2023 +0300

    ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)

    * ggml : sync latest (SAM + SD operators, CUDA alibi)

    ggml-ci

    * ggml : fix tabs

commit 2d17c224376c0fb2d6cfce8726de5a5f7b666bfe
Merge: 36b0c5b dadbed9
Author: Concedo <[email protected]>
Date:   Tue Aug 22 18:20:06 2023 +0800

    functional commit before gguf merge

commit 8e4364f2af9cd5d57240f23e83c0e29bc068bc02
Author: slaren <[email protected]>
Date:   Tue Aug 22 09:56:03 2023 +0200

    llama-bench : minor fixes (#2695)

commit 1e3bc523d8053a77df3ac7126a84d0297ee97ef6
Author: Kylin <[email protected]>
Date:   Tue Aug 22 15:14:23 2023 +0800

    ggml : support CUDA's half type for aarch64(#1455) (#2670)

    * ggml: support CUDA's half type for aarch64(#1455)
    support CUDA's half type for aarch64 in ggml_fp16_t definition

    * ggml: use __CUDACC__ to recognise nvcc compiler

commit 14b1d7e6f720dee41ce5a826376df738096d9033
Author: Shouzheng Liu <[email protected]>
Date:   Tue Aug 22 02:18:40 2023 -0400

    metal : add missing barriers for mul-mat (#2699)

commit 226255b44ef2c2794bfac48d101d35a9c2dbb965
Author: Jhen-Jie Hong <[email protected]>
Date:   Tue Aug 22 08:32:00 2023 +0800

    server : fallback to default if client param is null (#2688)

    * server : fallback to default if client param is null

    * server : do not overwrite 404 if status is 500 from exception_handler

commit 930523c8e1cbbee5449c055daa894917fac6805e
Author: Kerfuffle <[email protected]>
Date:   Mon Aug 21 18:01:34 2023 -0600

    Fix convert-llama-ggmlv3-to-gguf.py vocab conversion (#2698)

    When converting without metadata, the hex value for bytes entries weren't 0 padded to 2 digits.

commit 2d86b2e219ef988878bdea7e33a534aad3a744da
Author: Pontus Mårdnäs <[email protected]>
Date:   Mon Aug 21 23:46:56 2023 +0200

    Add --config argument

commit c8dba409e6d6a754090f08e6a862c5ffdd52e421
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 21 23:40:22 2023 +0300

    py : remove obsolete script

commit 6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 21 23:07:43 2023 +0300

    gguf : new file format with flexible meta data (beta) (#2398)

    * gguf : first API pass

    * gguf : read header + meta data

    * gguf : read tensor info

    * gguf : initial model loading - not tested

    * gguf : add gguf_get_tensor_name()

    * gguf : do not support passing existing ggml_context to gguf_init

    * gguf : simplify gguf_get_val

    * gguf : gguf.c is now part of ggml.c

    * gguf : read / write sample models

    * gguf : add comments

    * refactor : reduce code duplication and better API (#2415)

    * gguf : expose the gguf_type enum through the API for now

    * gguf : add array support

    * gguf.py : some code style changes

    * convert.py : start a new simplified implementation by removing old stuff

    * convert.py : remove GGML vocab + other obsolete stuff

    * GGUF : write tensor (#2426)

    * WIP: Write tensor

    * GGUF : Support writing tensors in Python

    * refactor : rm unused import and upd todos

    * fix : fix errors upd writing example

    * rm example.gguf

    * gitignore *.gguf

    * undo formatting

    * gguf : add gguf_find_key (#2438)

    * gguf.cpp : find key example

    * ggml.h : add gguf_find_key

    * ggml.c : add gguf_find_key

    * gguf : fix writing tensors

    * gguf : do not hardcode tensor names to read

    * gguf : write sample tensors to read

    * gguf : add tokenization constants

    * quick and dirty conversion example

    * gguf : fix writing gguf arrays

    * gguf : write tensors one by one and code reuse

    * gguf : fix writing gguf arrays

    * gguf : write tensors one by one

    * gguf : write tensors one by one

    * gguf : write tokenizer data

    * gguf : upd gguf conversion script

    * Update convert-llama-h5-to-gguf.py

    * gguf : handle already encoded string

    * ggml.h : get array str and f32

    * ggml.c : get arr str and f32

    * gguf.py : support any type

    * Update convert-llama-h5-to-gguf.py

    * gguf : fix set is not subscriptable

    * gguf : update convert-llama-h5-to-gguf.py

    * constants.py : add layer norm eps

    * gguf.py : add layer norm eps and merges

    * ggml.h : increase GGML_MAX_NAME to 64

    * ggml.c : add gguf_get_arr_n

    * Update convert-llama-h5-to-gguf.py

    * add gptneox gguf example

    * Makefile : add gptneox gguf example

    * Update convert-llama-h5-to-gguf.py

    * add gptneox gguf example

    * Update convert-llama-h5-to-gguf.py

    * Update convert-gptneox-h5-to-gguf.py

    * Update convert-gptneox-h5-to-gguf.py

    * Update convert-llama-h5-to-gguf.py

    * gguf : support custom alignment value

    * gguf : fix typo in function call

    * gguf : mmap tensor data example

    * fix : update convert-llama-h5-to-gguf.py

    * Update convert-llama-h5-to-gguf.py

    * convert-gptneox-h5-to-gguf.py : Special tokens

    * gptneox-main.cpp : special tokens

    * Update gptneox-main.cpp

    * constants.py : special tokens

    * gguf.py : accumulate kv and tensor info data + special tokens

    * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens

    * gguf : gguf counterpart of llama-util.h

    * gguf-util.h : update note

    * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens

    * convert-llama-h5-to-gguf.py : special tokens

    * Delete gptneox-common.cpp

    * Delete gptneox-common.h

    * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer

    * gptneox-main.cpp : gpt2 bpe tokenizer

    * gpt2 bpe tokenizer (handles merges and unicode)

    * Makefile : remove gptneox-common

    * gguf.py : bytesarray for gpt2bpe tokenizer

    * cmpnct_gpt2bpe.hpp : comments

    * gguf.py : use custom alignment if present

    * gguf : minor stuff

    * Update gptneox-main.cpp

    * map tensor names

    * convert-gptneox-h5-to-gguf.py : map tensor names

    * convert-llama-h5-to-gguf.py : map tensor names

    * gptneox-main.cpp : map tensor names

    * gguf : start implementing libllama in GGUF (WIP)

    * gguf : start implementing libllama in GGUF (WIP)

    * rm binary commited by mistake

    * upd .gitignore

    * gguf : calculate n_mult

    * gguf :  inference with 7B model working (WIP)

    * gguf : rm deprecated function

    * gguf : start implementing gguf_file_saver (WIP)

    * gguf : start implementing gguf_file_saver (WIP)

    * gguf : start implementing gguf_file_saver (WIP)

    * gguf : add gguf_get_kv_type

    * gguf : add gguf_get_kv_type

    * gguf : write metadata in gguf_file_saver (WIP)

    * gguf : write metadata in gguf_file_saver (WIP)

    * gguf : write metadata in gguf_file_saver

    * gguf : rm references to old file formats

    * gguf : shorter name for member variable

    * gguf : rm redundant method

    * gguf : get rid of n_mult, read n_ff from file

    * Update gguf_tensor_map.py

    * Update gptneox-main.cpp

    * gguf : rm references to old file magics

    * gguf : start implementing quantization (WIP)

    * gguf : start implementing quantization (WIP)

    * gguf : start implementing quantization (WIP)

    * gguf : start implementing quantization (WIP)

    * gguf : start implementing quantization (WIP)

    * gguf : start implementing quantization (WIP)

    * gguf : quantization is working

    * gguf : roper closing of file

    * gguf.py : no need to convert tensors twice

    * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice

    * convert-llama-h5-to-gguf.py : no need to convert tensors twice

    * convert-gptneox-h5-to-gguf.py : simplify nbytes

    * convert-llama-h5-to-gguf.py : simplify nbytes

    * gptneox-main.cpp : n_layer --> n_block

    * constants.py : n_layer --> n_block

    * gguf.py : n_layer --> n_block

    * convert-gptneox-h5-to-gguf.py : n_layer --> n_block

    * convert-llama-h5-to-gguf.py : n_layer --> n_block

    * gptneox-main.cpp : n_layer --> n_block

    * Update gguf_tensor_map.py

    * convert-gptneox-h5-to-gguf.py : load model in parts to save memory

    * convert-llama-h5-to-gguf.py : load model in parts to save memory

    * convert : write more metadata for LLaMA

    * convert : rm quantization version

    * convert-gptneox-h5-to-gguf.py : add file_type key

    * gptneox-main.cpp : add file_type key

    * fix conflicts

    * gguf : add todos and comments

    * convert-gptneox-h5-to-gguf.py : tensor name map changes

    * Create gguf_namemap.py : tensor name map changes

    * Delete gguf_tensor_map.py

    * gptneox-main.cpp : tensor name map changes

    * convert-llama-h5-to-gguf.py : fixes

    * gguf.py : dont add empty strings

    * simple : minor style changes

    * gguf : use UNIX line ending

    * Create convert-llama-7b-pth-to-gguf.py

    * llama : sync gguf-llama.cpp with latest llama.cpp (#2608)

    * llama : sync gguf-llama.cpp with latest llama.cpp

    * minor : indentation + assert

    * llama : refactor gguf_buffer and gguf_ctx_buffer

    * llama : minor

    * gitignore : add gptneox-main

    * llama : tokenizer fixes (#2549)

    * Merge tokenizer fixes into the gguf branch.

    * Add test vocabularies

    * convert : update convert-new.py with tokenizer fixes (#2614)

    * Merge tokenizer fixes into the gguf branch.

    * Add test vocabularies

    * Adapt convert-new.py (and fix a clang-cl compiler error on windows)

    * llama : sync gguf-llama with llama (#2613)

    * llama : sync gguf-llama with llama

    * tests : fix build + warnings (test-tokenizer-1 still fails)

    * tests : fix wstring_convert

    * convert : fix layer names

    * llama : sync gguf-llama.cpp

    * convert : update HF converter to new tokenizer voodoo magics

    * llama : update tokenizer style

    * convert-llama-h5-to-gguf.py : add token types

    * constants.py : add token types

    * gguf.py : add token types

    * convert-llama-7b-pth-to-gguf.py : add token types

    * gguf-llama.cpp :  fix n_head_kv

    * convert-llama-h5-to-gguf.py : add 70b gqa support

    * gguf.py : add tensor data layout

    * convert-llama-h5-to-gguf.py : add tensor data layout

    * convert-llama-7b-pth-to-gguf.py : add tensor data layout

    * gptneox-main.cpp : add tensor data layout

    * convert-llama-h5-to-gguf.py : clarify the reverse permute

    * llama : refactor model loading code (#2620)

    * llama : style formatting + remove helper methods

    * llama : fix quantization using gguf tool

    * llama : simplify gguf_file_saver

    * llama : fix method names

    * llama : simplify write_header()

    * llama : no need to pass full file loader to the file saver

    just gguf_ctx

    * llama : gguf_file_saver write I32

    * llama : refactor tensor names (#2622)

    * gguf: update tensor names searched in quantization

    * gguf : define tensor names as constants

    * gguf : initial write API (not tested yet)

    * gguf : write to file API (not tested)

    * gguf : initial write API ready + example

    * gguf : fix header write

    * gguf : fixes + simplify example + add ggml_nbytes_pad()

    * gguf : minor

    * llama : replace gguf_file_saver with new gguf write API

    * gguf : streaming support when writing files

    * gguf : remove oboslete write methods

    * gguf : remove obosolete gguf_get_arr_xxx API

    * llama : simplify gguf_file_loader

    * llama : move hparams and vocab from gguf_file_loader to llama_model_loader

    * llama : merge gguf-util.h in llama.cpp

    * llama : reorder definitions in .cpp to match .h

    * llama : minor simplifications

    * llama : refactor llama_model_loader (WIP)

    wip : remove ggml_ctx from llama_model_loader

    wip : merge gguf_file_loader in llama_model_loader

    * llama : fix shape prints

    * llama : fix Windows build + fix norm_rms_eps key

    * llama : throw error on missing KV paris in model meta data

    * llama : improve printing + log meta data

    * llama : switch print order of meta data

    ---------

    Co-authored-by: M. Yusuf Sarıgöz <[email protected]>

    * gguf : deduplicate (#2629)

    * gguf : better type names

    * dedup : CPU + Metal is working

    * ggml : fix warnings about unused results

    * llama.cpp : fix line feed and compiler warning

    * llama : fix strncpy warning + note token_to_str does not write null

    * llama : restore the original load/save session implementation

    Will migrate this to GGUF in the future

    * convert-llama-h5-to-gguf.py : support alt ctx param name

    * ggml : assert when using ggml_mul with non-F32 src1

    * examples : dedup simple

    ---------

    Co-authored-by: klosax <[email protected]>

    * gguf.py : merge all files in gguf.py

    * convert-new.py : pick #2427 for HF 70B support

    * examples/gguf : no need to keep q option for quantization any more

    * llama.cpp : print actual model size

    * llama.cpp : use ggml_elements()

    * convert-new.py : output gguf (#2635)

    * convert-new.py : output gguf (WIP)

    * convert-new.py : add gguf key-value pairs

    * llama : add hparams.ctx_train + no longer print ftype

    * convert-new.py : minor fixes

    * convert-new.py : vocab-only option should work now

    * llama : fix tokenizer to use llama_char_to_byte

    * tests : add new ggml-vocab-llama.gguf

    * convert-new.py : tensor name mapping

    * convert-new.py : add map for skipping tensor serialization

    * convert-new.py : convert script now works

    * gguf.py : pick some of the refactoring from #2644

    * convert-new.py : minor fixes

    * convert.py : update to support GGUF output

    * Revert "ci : disable CI temporary to not waste energy"

    This reverts commit 7e82d25f40386540c2c15226300ad998ecd871ea.

    * convert.py : n_head_kv optional and .gguf file extension

    * convert.py : better always have n_head_kv and default it to n_head

    * llama : sync with recent PRs on master

    * editorconfig : ignore models folder

    ggml-ci

    * ci : update ".bin" to ".gguf" extension

    ggml-ci

    * llama : fix llama_model_loader memory leak

    * gptneox : move as a WIP example

    * llama : fix lambda capture

    ggml-ci

    * ggml : fix bug in gguf_set_kv

    ggml-ci

    * common.h : .bin --> .gguf

    * quantize-stats.cpp : .bin --> .gguf

    * convert.py : fix HF tensor permuting / unpacking

    ggml-ci

    * llama.cpp : typo

    * llama : throw error if gguf fails to init from file

    ggml-ci

    * llama : fix tensor name grepping during quantization

    ggml-ci

    * gguf.py : write tensors in a single pass (#2644)

    * gguf : single pass for writing tensors + refactoring writer

    * gguf : single pass for writing tensors + refactoring writer

    * gguf : single pass for writing tensors + refactoring writer

    * gguf : style fixes in simple conversion script

    * gguf : refactor gptneox conversion script

    * gguf : rename h5 to hf (for HuggingFace)

    * gguf : refactor pth to gguf conversion script

    * gguf : rm file_type key and method

    * gguf.py : fix vertical alignment

    * gguf.py : indentation

    ---------

    Co-authored-by: Georgi Gerganov <[email protected]>

    * convert-gptneox-hf-to-gguf.py : fixes

    * gguf.py : gptneox mapping

    * convert-llama-hf-to-gguf.py : fixes

    * convert-llama-7b-pth-to-gguf.py : fixes

    * ggml.h : reverse GGUF_MAGIC

    * gguf.py : reverse GGUF_MAGIC

    * test-tokenizer-0.cpp : fix warning

    * llama.cpp : print kv general.name

    * llama.cpp : get special token kv and linefeed token id

    * llama : print number of tensors per type + print arch + style

    * tests : update vocab file with new magic

    * editorconfig : fix whitespaces

    * llama : re-order functions

    * llama : remove C++ API + reorganize common source in /common dir

    * llama : minor API updates

    * llama : avoid hardcoded special tokens

    * llama : fix MPI build

    ggml-ci

    * llama : introduce enum llama_vocab_type + remove hardcoded string constants

    * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested

    * falcon-main.cpp : falcon inference example

    * convert-falcon-hf-to-gguf.py : remove extra kv

    * convert-gptneox-hf-to-gguf.py : remove extra kv

    * convert-llama-7b-pth-to-gguf.py : remove extra kv

    * convert-llama-hf-to-gguf.py : remove extra kv

    * gguf.py : fix for falcon 40b

    * falcon-main.cpp : fix for falcon 40b

    * convert-falcon-hf-to-gguf.py : update ref

    * convert-falcon-hf-to-gguf.py : add tensor data layout

    * cmpnct_gpt2bpe.hpp : fixes

    * falcon-main.cpp : fixes

    * gptneox-main.cpp : fixes

    * cmpnct_gpt2bpe.hpp : remove non-general stuff

    * Update examples/server/README.md

    Co-authored-by: slaren <[email protected]>

    * cmpnct_gpt2bpe.hpp : cleanup

    * convert-llama-hf-to-gguf.py : special tokens

    * convert-llama-7b-pth-to-gguf.py : special tokens

    * convert-permute-debug.py : permute debug print

    * convert-permute-debug-master.py : permute debug for master

    * convert-permute-debug.py : change permute type of attn_q

    * convert.py : 70b model working (change attn_q permute)

    * Delete convert-permute-debug-master.py

    * Delete convert-permute-debug.py

    * convert-llama-hf-to-gguf.py : fix attn_q permute

    * gguf.py : fix rope scale kv

    * convert-llama-hf-to-gguf.py : rope scale and added tokens

    * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens

    * llama.cpp : use rope scale kv

    * convert-llama-7b-pth-to-gguf.py : rope scale fix

    * convert-llama-hf-to-gguf.py : rope scale fix

    * py : fix whitespace

    * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682)

    * First pass at converting GGMLv3 LLaMA models to GGUF

    * Cleanups, better output during conversion

    * Fix vocab space conversion logic

    * More vocab conversion fixes

    * Add description to converted GGUF files

    * Improve help text, expand warning

    * Allow specifying name and description for output GGUF

    * Allow overriding vocab and hyperparams from original model metadata

    * Use correct params override var name

    * Fix wrong type size for Q8_K

    Better handling of original style metadata

    * Set default value for gguf add_tensor raw_shape KW arg

    * llama : improve token type support (#2668)

    * Merge tokenizer fixes into the gguf branch.

    * Add test vocabularies

    * Adapt convert-new.py (and fix a clang-cl compiler error on windows)

    * Improved tokenizer test

    But does it work on MacOS?

    * Improve token type support

    - Added @klosax code to convert.py
    - Improved token type support in vocabulary

    * Exclude platform dependent tests

    * More sentencepiece compatibility by eliminating magic numbers

    * Restored accidentally removed comment

    * llama : add API for token type

    ggml-ci

    * tests : use new tokenizer type API (#2692)

    * Merge tokenizer fixes into the gguf branch.

    * Add test vocabularies

    * Adapt convert-new.py (and fix a clang-cl compiler error on windows)

    * Improved tokenizer test

    But does it work on MacOS?

    * Improve token type support

    - Added @klosax code to convert.py
    - Improved token type support in vocabulary

    * Exclude platform dependent tests

    * More sentencepiece compatibility by eliminating magic numbers

    * Restored accidentally removed comment

    * Improve commentary

    * Use token type API in test-tokenizer-1.cpp

    * py : cosmetics

    * readme : add notice about new file format

    ggml-ci

    ---------

    Co-authored-by: M. Yusuf Sarıgöz <[email protected]>
    Co-authored-by: klosax <[email protected]>
    Co-authored-by: goerch <[email protected]>
    Co-authored-by: slaren <[email protected]>
    Co-authored-by: Kerfuffle <[email protected]>

commit dadbed99e65252d79f81101a392d0d6497b86caa
Author: Shouzheng Liu <[email protected]>
Date:   Mon Aug 21 06:59:29 2023 -0400

    metal : fix synchronization in new matrix multiplication kernel (#2686)

commit cb1c0727bd59803b439b6a3af121c99e6393ff3d
Author: Kawrakow <[email protected]>
Date:   Mon Aug 21 11:11:31 2023 +0300

    HellaSwag: split token evaluation into batches if needed (#2681)

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit 9e232f0234073358e7031c1b8d7aa45020469a3b
Author: slaren <[email protected]>
Date:   Sun Aug 20 22:17:53 2023 +0200

    ggml : move all type info to ggml_type_traits (#2663)

commit 5e9ff54a675d163d9f42aad1b5b3e734f17b2701
Author: Kawrakow <[email protected]>
Date:   Sun Aug 20 16:44:46 2023 +0300

    More efficient Hellaswag implementation (#2677)

    Co-authored-by: Iwan Kawrakow <[email protected]>

commit b34f4bd2724733e188ec4f6074042f66a5ed28c9
Author: YellowRoseCx <[email protected]>
Date:   Sat Aug 19 17:12:52 2023 -0500

    Update README.md

commit 1f0bccb27929e261744c979bc75114955da49e98
Author: Georgi Gerganov <[email protected]>
Date:   Sat Aug 19 00:45:36 2023 +0300

    server : better default prompt (#2646)

commit f63564adfaa157ca387071d6b9a06cfaef0ef576
Author: Jhen-Jie Hong <[email protected]>
Date:   Sat Aug 19 05:41:32 2023 +0800

    server : update xxd usage for older versions compatibility (#2649)

    * server : update xxd usage for older versions compatibility

    * remove unused $func

commit 2d8b76a110d76ff6b5728ff0af8477531e4db60e
Author: Adrian <[email protected]>
Date:   Fri Aug 18 12:39:22 2023 -0700

    Add link to clojure bindings to Readme. (#2659)

commit 7af633aec339367e36c867ae709088d6a801aa75
Author: Georgi Gerganov <[email protected]>
Date:   Fri Aug 18 17:48:31 2023 +0300

    readme : incoming BREAKING CHANGE

commit 097e121e2f17ed3541cf02c55ff7e9febc091b19
Author: slaren <[email protected]>
Date:   Fri Aug 18 12:44:58 2023 +0200

    llama : add benchmark example (#2626)

    * llama : add benchmark example

    * add to examples CMakeLists.txt

    * fix msvc build

    * add missing include

    * add Bessel's correction to stdev calculation

    Co-authored-by: Johannes Gäßler <[email protected]>

    * improve markdown formatting

    * add missing include

    * print warning is NDEBUG is not defined

    * remove n_prompt and n_gen from the matrix, use each value separately instead

    * better checks for non-optimized builds

    * llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call

    * fix json formatting

    * add sql output

    * add basic cpu and gpu info (linx/cuda only)

    * markdown: also show values that differ from the default

    * markdown: add build id

    * cleanup

    * improve formatting

    * formatting

    ---------

    Co-authored-by: Johannes Gäßler <[email protected]>

commit eaf98c2649d7da705de255712f0038ac7e47c610
Author: mdrokz <[email protected]>
Date:   Fri Aug 18 15:47:58 2023 +0530

    readme : add link to Rust bindings (#2656)

commit e9b12c332ec6e215fbac4b2ef165353acbcd8319
Author: Georgi Gerganov <[email protected]>
Date:   Fri Aug 18 12:48:55 2023 +0300

    perplexity : more meaningful ETA number - 2 decimal points

commit 604b8bdfa6320bbcb018eebcc1252dfede603c6b
Author: Evan Jones <[email protected]>
Date:   Thu Aug 17 19:54:44 2023 -0400

    Fix unicode in grammars (fixes #2501) (#2553)

    * Fix unicode in grammars (fixes #2501)

    * add more comments

    * fix test-llama-grammar

commit 10151bee2e38b5711335c4a38f6ca93b50223acf
Author: staviq <[email protected]>
Date:   Thu Aug 17 23:34:01 2023 +0000

    server : support for saving templates in browser LocalStorage (#2486)

    * support for templates in browser LocalStorage

    * sync accepted #2409 fix from upstream

    * convert autosave invocation to useEffect

    * Apply suggestions from code review

    Co-authored-by: Jhen-Jie Hong <[email protected]>

    * Regen index.html.cpp, suggested from code review

    ---------

    Co-authored-by: Jhen-Jie Hong <[email protected]>

commit 0992a7b8b18a89e29a205efb48ceb559c9a08203
Author: Johannes Gäßler <[email protected]>
Date:   Thu Aug 17 23:57:59 2023 +0200

    README: fix LLAMA_CUDA_MMV_Y documentation (#2647)

commit 6ddeefad9b634c5c79e6bcf046523493ff1fdf7d
Author: Henri Vasserman <[email protected]>
Date:   Thu Aug 17 23:11:18 2023 +0300

    [Zig] Fixing Zig build and improvements (#2554)

    * Fix zig after console.o was split

    * Better include and flag management

    * Change LTO to option

commit 36b0c5b39816c039a5235733cfcd2b4e32386ff9
Author: Concedo <[email protected]>
Date:   Thu Aug 17 22:45:49 2023 +0800

    fix for incorrect missing backends displayed

commit 8dae7ce68437faf1fa96ec0e7687b8700956ef20
Author: Kerfuffle <[email protected]>
Date:   Thu Aug 17 07:29:44 2023 -0600

    Add --cfg-negative-prompt-file option for examples (#2591)

    Add --cfg-negative-prompt-file option for examples

commit a73ccf1aa34de49f61bfeb7f8a679c3bfdb3abe3
Author: Georgi Gerganov <[email protected]>
Date:   Thu Aug 17 10:47:09 2023 +0300

    llama : replace (permute + reshape + view_1d) with (view_3d) (#2538)

    ggml-ci

commit 7cf54e1f746941279d81d485796777c01f88049c
Author: drbh <[email protected]>
Date:   Thu Aug 17 03:41:01 2023 -0400

    tests : adds simple llama grammar tests (#2618)

    * adds simple llama grammar tests

    * fix lint and add Makefile

    * 0 terminate code_points

    * avoid dangling pointers in candidate cleanup

    * cleanup grammar at end of test

commit a872a2b28eaefc8d464eaa535c94deeb501666f9
Author: Shouzheng Liu <[email protected]>
Date:   Thu Aug 17 03:35:53 2023 -0400

    ggml-alloc : fix discrepency between measure&eval (#2639)

    The GGML memory allocator consistently places a tensor within the
    optimal-fit memory block, which is the smallest block capable of
    accommodating the tensor's size. During the measurement phase, the final
    block is generously sized, ensuring it never qualifies as the
    optimal-fit block as long as there exists another block capable of
    accommodating the tensor. Nevertheless, in the evaluation phase, the
    last block is constrained in size and could potentially qualify as the
    optimal-fit block. Consequently, there exists the possibility of a
    tensor being allocated to a different region during evaluation, leading
    to more memory fragmentation in our scratch buffer.

    This recent commit guarantees uniform behavior of the allocator across
    both the measurement and evaluation phases, eliminating discrepancies
    between the two.

commit 0919a0f73d95cfb93a1646a1d1741a0615fe2c5e
Author: Kolen Cheung <[email protected]>
Date:   Wed Aug 16 21:09:49 2023 +0100

    cmake : install ggml-meta.metal if LLAMA_METAL (#2449)

commit ed53db86c3b0e0815331a96d7a379edb5e62472c
Author: Jhen-Jie Hong <[email protected]>
Date:   Thu Aug 17 04:09:03 2023 +0800

    metal : print error of load pipeline state (#2564)

    * metal : print error of load pipeline state

    * metal : return null if load pipeline failed

commit fc8ef549e50087762a0b4f901cd74b2defcc6ae3
Author: Shouzheng Liu <[email protected]>
Date:   Wed Aug 16 16:08:28 2023 -0400

    metal : enable ggml-alloc (#2627)

    * metal: enable ggml-alloc

    Make ggml-alloc work with concurrently dispatch.

    * style-fix

    Co-authored-by: slaren <[email protected]>

    ---------

    Co-authored-by: slaren <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>

commit bf83bff6742c0f1795b4c18695a13a34ac7adf62
Author: Shouzheng Liu <[email protected]>
Date:   Wed Aug 16 16:07:04 2023 -0400

    metal : matrix-matrix multiplication kernel (#2615)

    * metal: matrix-matrix multiplication kernel

    This commit removes MPS and uses custom matrix-matrix multiplication
    kernels for all quantization types. This commit also adds grouped-query
    attention to support llama2 70B.

    * metal: fix performance degradation from gqa

    Integers are slow on the GPU, and 64-bit divides are extremely slow.
    In the context of GQA, we introduce a 64-bit divide that cannot be
    optimized out by the compiler, which results in a decrease of ~8% in
    inference performance. This commit fixes that issue by calculating a
    part of the offset with a 32-bit divide. Naturally, this limits the
    size of a single matrix to ~4GB. However, this limitation should
    suffice for the near future.

    * metal: fix bugs for GQA and perplexity test.

    I mixed up ne02 and nb02 in previous commit.

commit 075d079a72c741050a4c31a27530c8af19df70a6
Merge: 469d70b b5ffb28
Author: Concedo <[email protected]>
Date:   Wed Aug 16 10:43:06 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	CMakeLists.txt
    #	Makefile
    #	ggml-cuda.cu
    #	llama-util.h
    #	tests/CMakeLists.txt

commit b5ffb2849d23afe73647f68eec7b68187af09be6
Author: Georgi Gerganov <[email protected]>
Date:   Tue Aug 15 10:04:58 2023 +0300

    scripts : add helper script to get wikitext

commit 469d70be45dfdac4d926c1326b579e88d0f0e040
Author: Concedo <[email protected]>
Date:   Tue Aug 15 13:49:05 2023 +0800

    add support for precompiled binaries, used as a fallback

commit 7d1196108ad330b32845546fb3472c2172a0b6b8
Author: YellowRoseCx <[email protected]>
Date:   Mon Aug 14 23:03:12 2023 -0500

    remove force DMMV

commit 3ebb00935f3f0522b75df49c2769ab1774b91380
Author: Jhen-Jie Hong <[email protected]>
Date:   Tue Aug 15 06:14:14 2023 +0800

    server : add missing /json-schema-to-grammar.mjs (#2616)

    fixes #2611

commit d783f7982e0e823a2626a9956359c0d36c1a7e21
Author: Jhen-Jie Hong <[email protected]>
Date:   Mon Aug 14 21:37:39 2023 +0800

    metal : return null instead of exit(1) (#2573)

commit d75561df207d22790609ee0ad924302f66ac2599
Author: Cheng Shao <[email protected]>
Date:   Mon Aug 14 15:36:42 2023 +0200

    server : add --numa support (#2524)

commit 348acf188c9fbe66396990f2dc83229df367969b
Author: Kamil Tomšík <[email protected]>
Date:   Mon Aug 14 15:35:16 2023 +0200

    llama : add missing enum keyword in function signatures (#2610)

commit 1cd06fa25eb859b14b3427a1d815a48f25fc3c34
Author: Johannes Gäßler <[email protected]>
Date:   Mon Aug 14 10:41:22 2023 +0200

    CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596)

commit 2feb8934eb75ca63f3c42724229cce1df9579c8e
Author: Jhen-Jie Hong <[email protected]>
Date:   Mon Aug 14 16:20:17 2023 +0800

    server : fix default grammar by use empty string in the UI (#2604)

commit 5517d6e69214cdead000a76983b9fe175c3f8329
Author: Jhen-Jie Hong <[email protected]>
Date:   Mon Aug 14 15:16:54 2023 +0800

    server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588)

    * server : implement json-schema-to-grammar.mjs by follow python impl

    * server : add grammar support in chat.mjs

    * server : implement grammer param in the UI

    * server : generate .hpp

    * server : remove trailing whitespaces

    * server : generate .hpp

    * server : fix sort of prop pairs

    * server : optimize regex & iteration

commit f31b5397143009d682db90fd2a6cde83f1ef00eb
Author: vxiiduu <[email protected]>
Date:   Mon Aug 14 13:59:16 2023 +1000

    Enhance Windows 7 and below compatibility. (#2592)

    * Enhance Windows 7 compatibility.
    * Clean away unnecessary preprocessor conditional

commit ee77efea2a1e3f7d153976b0934522b6bbaa62e6
Author: drbh <[email protected]>
Date:   Sun Aug 13 10:00:48 2023 -0400

    test : add simple grammar parsing tests (#2594)

    * adds simple grammar parsing tests

    * adds cassert header

commit f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e
Author: Johannes Gäßler <[email protected]>
Date:   Sun Aug 13 00:24:45 2023 +0200

    CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590)

commit cd61aa0d9e16627935c7978adf488a679ddfa745
Author: YellowRoseCx <[email protected]>
Date:   Sat Aug 12 17:24:31 2023 -0500

    restore main_gpu parameter

commit 4a042f326830271a4c31104051b7b08e08ac234e
Author: Henri Vasserman <[email protected]>
Date:   Sat Aug 12 10:51:46 2023 +0300

    gfx1100 support

    ---------

    Co-authored-by: ardfork <[email protected]>
    Co-authored-by: jammm <[email protected]>
    Co-authored-by: jdecourval <[email protected]>

commit 8913bc6fea97d3cb860937b0461f455c6abe3ea1
Author: Henri Vasserman <[email protected]>
Date:   Fri Aug 11 10:16:02 2023 +0300

    Allow overriding CC_TURING

commit e77a4c37a756c002e97173f4122e088fb304e18a
Author: Henri Vasserman <[email protected]>
Date:   Fri Aug 11 10:00:07 2023 +0300

    Merge 'origin/master' into hipblas

commit cc4c4e355cd553b1557d5fba2562e824db93f9b4
Author: Engininja2 <[email protected]>
Date:   Fri Aug 11 09:43:14 2023 +0300

    New __dp4a assembly

    Now compatible with gfx900 and faster as well.

commit 1a03b709848ce68d5bf5966237756167e2cac540
Author: Henri Vasserman <[email protected]>
Date:   Fri Aug 11 09:30:28 2023 +0300

    Undo mess

    ---------

    Co-authored-by: ardfork <[email protected]>

commit 4366ff9ba1b1f12e494118ef9b5198479022fcc5
Author: DannyDaemonic <[email protected]>
Date:   Thu Aug 10 13:11:36 2023 -0700

    Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.

commit 811ff855a24323cafddc95c1b8aca711fef05f76
Author: Christian Demsar <[email protected]>
Date:   Thu Aug 10 10:28:27 2023 -0400

    Add --n-predict -2 for stopping generation on full context (#2565)

commit 37c9717aaa6815b6a5be21aaab970212f20fe6bf
Author: Martin Krasser <[email protected]>
Date:   Thu Aug 10 12:16:38 2023 +0200

    Fix grammar-based sampling issue in server (#2566)

commit 9483288e0318a4dcc2e08eb817dfdd09c6552533
Merge: dae9dff b19edd5
Author: Concedo <[email protected]>
Date:   Sat Aug 12 16:04:11 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	Makefile

commit b19edd54d51cef5e3616c18b1d0d8626895b2cba
Author: byte-6174 <[email protected]>
Date:   Fri Aug 11 19:17:25 2023 -0400

    Adding support for llama2.c models (#2559)

commit 53dc399472d5bd35ee739b865e843b1996bd3814
Author: Equim <[email protected]>
Date:   Sat Aug 12 06:35:14 2023 +0800

    server: fixed wrong variable name in timing json (#2579)

    * server: fixed wrong variable name in timing json

    * remove redunct entry

commit dae9dffa6aa53923cfbb09ac5de7e08f34920733
Author: Concedo <[email protected]>
Date:   Fri Aug 11 14:54:27 2023 +0800

    rename koboldcpp.dll to koboldcpp_default.dll

commit 9ca4abed893685692f90413e4d43153af12342d9
Author: DannyDaemonic <[email protected]>
Date:   Thu Aug 10 13:11:36 2023 -0700

    Handle `ENABLE_VIRTUAL_TERMINAL_PROCESSING` more gracefully on earlier versions of Windows.

commit d18ecd5b9e5dde58ae08a3eef1637406159ddaca
Author: YellowRoseCx <[email protected]>
Date:   Thu Aug 10 13:19:41 2023 -0500

    make mmq gen faster for amd

commit 243894a952147a4fac5b6aee748861a0df6cc2c6
Author: Henri Vasserman <[email protected]>
Date:   Thu Aug 10 12:14:40 2023 +0300

    ws fix

commit ac2f14da445ea87d73539adbd29d19ff2c9eba58
Author: Engininja2 <[email protected]>
Date:   Thu Aug 10 12:11:27 2023 +0300

    AMD assembly optimized __dp4a

    Doesn't seem to work for gfx900, so commented out.

commit 9dba0c985f140ddded8cbb671f139e81fff82eed
Author: Henri Vasserman <[email protected]>
Date:   Thu Aug 10 12:09:28 2023 +0300

    Fix merge

    ---------

    Co-authored-by: ardfork <[email protected]>
    Co-authored-by: Kerfuffle <[email protected]>

commit e59fcb2bc129881f4a269fee748fb38bce0a64de
Author: Christian Demsar <[email protected]>
Date:   Thu Aug 10 10:28:27 2023 -0400

    Add --n-predict -2 for stopping generation on full context (#2565)

commit 886f4eed7948f494e3da1d48d4f6f844e2f9a2c2
Author: Concedo <[email protected]>
Date:   Thu Aug 10 22:01:33 2023 +0800

    updated lite, up ver, remove bell

commit 1638757767072a4957f52b9e3594f0b67610631b
Author: Martin Krasser <[email protected]>
Date:   Thu Aug 10 12:16:38 2023 +0200

    Fix grammar-based sampling issue in server (#2566)

commit c5f5209d37b09325377e36f39eab0b0f0c0d006e
Author: Concedo <[email protected]>
Date:   Thu Aug 10 16:30:02 2023 +0800

    globalize args

commit f570b5cb1070591527a82d94bba408927b37778d
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 22:11:20 2023 -0500

    Revert "revert cuda changes as they are bugggy"

    This reverts commit 1541bf879772aeeed8ff646bfc52185c2a88b79b.

commit 1541bf879772aeeed8ff646bfc52185c2a88b79b
Author: Concedo <[email protected]>
Date:   Wed Aug 9 22:36:41 2023 +0800

    revert cuda changes as they are bugggy

commit bacc20203efb1839aa313858a04d75255bb4b7f4
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 20:37:17 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit b7cb4cfd109986bd66e8fd382d1e2516eaddfebb
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 20:00:52 2023 -0500

    additional fixes

commit fadae727baa3735ad3e0667384d6e05ca056b3ef
Merge: 518eb2a 8f8ab6c
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 18:45:50 2023 -0500

    Merge branch 'hipblas' into develop4Main

commit 518eb2af9225f8300a108c4244c7eb0a2217c3bc
Merge: bda0215 cae6a84
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 18:32:10 2023 -0500

    Merge remote-tracking branch 'upstream/concedo' into develop2Main

commit bda0215b413bafc49890aa23fc35f96a191fb3e0
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 18:17:54 2023 -0500

    update makefile to multisystem path

commit 8f8ab6c4c049df501e9a5ed8fef3aa0fc0691421
Author: YellowRoseCx <[email protected]>
Date:   Wed Aug 9 18:05:03 2023 -0500

    hipLDFLAG Path change Unix to multisystem in Makefile

    changed the hardcoded linux distro hipblas LD path from -L/opt/rocm/lib to use the defined ROCM_PATH variable to be flexible with ROCm on non-Linux OS

commit 610ba4cfc460ed65c4adc32d3365a216690384d5
Merge: 4024f91 25d43e0
Author: Henri Vasserman <[email protected]>
Date:   Wed Aug 9 23:54:58 2023 +0300

    Merge 'origin/master' into hipblas

commit 916a9acdd0a411426690400ebe2bb7ce840a6bba
Author: Sam Spilsbury <[email protected]>
Date:   Wed Aug 9 23:47:42 2023 +0300

    ggml-alloc: Don't try to re-use buffers of external tensors (#2562)

    * ggml-alloc: Don't try to re-use buffers of external tensors

    They might be weights that came from another context, so we
    have no control over them (and they might be re-used elsewhere
    so writing to them would be a bad idea).

    * ggml-alloc: >= when checking for out-of-bounds

    Co-authored-by: slaren <[email protected]>

    ---------

    Co-authored-by: slaren <[email protected]>

commit ea04a4ca1940d92becc0ee26523aa2c4a18cf938
Author: grahameth <[email protected]>
Date:   Wed Aug 9 22:46:40 2023 +0200

    add log_callback to llama_context_params for custom logging. (#2234)

    * add log_callback to llama_context_params for custom logging.

    * Fix macro expansion on gcc

    * Add struct llama_state for global variables and move log_callback there

    * Turn log level into enum and some minor changes.

    * Remove model_for_logging parameter (not needed anymore)

    * Convert remaining fprintf(stderr, ...) calls to use new macros.

    * Fix enum and initialize g_state

    * Fix log calls after merge

    * Fix missing static

    * Add back all the new lines in the logging strings

    * Add comment for llama_log_callback and replace remaining printf calls

    ---------

    Co-authored-by: grahameth <->
    Co-authored-by: Helmut <[email protected]>

commit a07e6dd3ad1a622f08c3187799879d4f1c49bad4
Author: Concedo <[email protected]>
Date:   Wed Aug 9 22:36:41 2023 +0800

    revert cuda changes as they are bugggy

commit f8376c7e610f68d07e079ff91f6988fb7a8399e2
Author: Concedo <[email protected]>
Date:   Wed Aug 9 21:23:33 2023 +0800

    up ver, fixed compile (+1 squashed commits)

    Squashed commits:

    [ca51aa9e] up ver

commit ba09f1c807956c59d8c64988626e95459f627ced
Merge: 3a7853d 25d43e0
Author: Concedo <[email protected]>
Date:   Wed Aug 9 21:18:34 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	README.md
    #	ggml-cuda.cu

commit 3a7853d259c242d4977e9f4dc7627a799d5812b4
Author: Concedo <[email protected]>
Date:   Wed Aug 9 21:07:57 2023 +0800

    handle stablecode-completion-alpha-3b

commit 25d43e0eb578b6e73046d9d6644a3a14d460600d
Author: Johannes Gäßler <[email protected]>
Date:   Wed Aug 9 09:42:34 2023 +0200

    CUDA: tuned mul_mat_q kernels (#2546)

commit 90058d96b0c6ab77802e153c23fad66d2f21a438
Author: Concedo <[email protected]>
Date:   Wed Aug 9 15:28:07 2023 +0800

    sleep longer before exit

commit 19cf2a8663938c424407544c13749f371104517b
Author: Concedo <[email protected]>
Date:   Wed Aug 9 12:42:59 2023 +0800

    add idle field and up ver

commit 4b8a354895e078d3f0cafdf53430d72d3af8bb99
Author: Concedo <[email protected]>
Date:   Wed Aug 9 12:25:21 2023 +0800

    cudatoolkit version

commit 159ad9269d95bc07720c79debc23b5c466357b53
Author: Concedo <[email protected]>
Date:   Wed Aug 9 11:50:12 2023 +0800

    up ver, set the cuda pool malloc lookahead back to 5% instead of 2% (+1 squashed commits)

    Squashed commits:

    [e0f65278] up ver, set the cuda pool malloc lookahead back to 5% instead of 2%

commit 4024f91a665d83b6de8658d45ec9d004c5d90c79
Author: Henri Vasserman <[email protected]>
Date:   Wed Aug 9 01:56:44 2023 +0300

    Add intrinsics polyfills for AMD

    ---------

    Co-authored-by: ardfork <[email protected]>
    Co-authored-by: funnbot <[email protected]>
    Co-authored-by: Engininja2 <[email protected]>

commit ab6212864ce8e9af200bcedb3e0126ee49aa8d0a
Merge: d91456a f5bfea0
Author: Henri Vasserman <[email protected]>
Date:   Wed Aug 9 00:37:01 2023 +0300

    Merge 'origin/master' into hipblas

commit 926d90fbabe836d16a5326eb99bdcb89ca0fc042
Merge: 793cfd1 f5bfea0
Author: Concedo <[email protected]>
Date:   Wed Aug 9 01:09:04 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	Makefile

commit 793cfd136cc721884f79d09036b748e4f176cdb4
Author: Concedo <[email protected]>
Date:   Wed Aug 9 01:05:00 2023 +0800

    fixed 70B detection again, try fix horde issues, fixed lite unicode issue, fixed cmake for cuda

commit f5bfea0580e417f99850d5456ca541d871a3e48c
Author: Martin Krasser <[email protected]>
Date:   Tue Aug 8 15:29:19 2023 +0200

    Allow passing grammar to completion endpoint (#2532)

    * Allow passing grammar to completion endpoint

commit acfc5478ff3446ca3b54553967a3dea09b7c771a
Author: Johannes Gäßler <[email protected]>
Date:   Tue Aug 8 14:38:16 2023 +0200

    CUDA: tighter VRAM scratch size for 65b/70b (#2551)

commit 7ed8d1fe7f8cbe6a6763e6b46759795ac8d21e12
Author: chaihahaha <[email protected]>
Date:   Tue Aug 8 20:07:02 2023 +0800

    llm.vim : multiline autocompletion, get rid of "^@" (#2543)

commit e7f94d6fdc83b41ba449b4b8c80821673dd12ffc
Author: Georgi Gerganov <[email protected]>
Date:   Tue Aug 8 15:05:30 2023 +0300

    vim : bring back simple llm.vim example

commit 2d7baaf50f3277e65cf71071f61ea34823d14c30
Author: AustinMroz <[email protected]>
Date:   Tue Aug 8 06:44:48 2023 -0500

    vim : streaming and more (#2495)

    * Update Vim plugin

    * Remove getbufoneline usage, Add input bind example.

    getbufoneline() appears to be a recently added function and has been
    replaced with getbufline for compatibility.

    An additional example that explains how to add a keybind that works in
    insert mode was added.

commit f3c3b4b1672d860800639c87d3b5d17564692469
Author: klosax <[email protected]>
Date:   Mon Aug 7 19:07:19 2023 +0200

    Add --rope-scale parameter (#2544)

    * common.cpp : Add --rope-scale parameter
    * README.md : Add info about using linear rope scaling

commit 3554080502cb050ccc3ae11d7a67df866ac3bd07
Author: Concedo <[email protected]>
Date:   Tue Aug 8 00:41:02 2023 +0800

    fixed blasbatchmul multiplier

commit 28ad80b6e4d38dde9e395fc5d4ebf19dc4aa4b66
Merge: 3c7d938 93356bd
Author: Concedo <[email protected]>
Date:   Tue Aug 8 00:34:10 2023 +0800

    Merge branch 'master' into concedo_experimental

commit 3c7d938d95fd51780be37f10cdddb2f26a770adf
Author: Concedo <[email protected]>
Date:   Tue Aug 8 00:32:51 2023 +0800

    update lite, resize scratch buffers for blasbatch 2048

commit 93356bdb7a324a8f6570f99d02af392cd4c45796
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 7 14:25:58 2023 +0300

    ggml : mul mat tweaks (#2372)

    * ggml : mul mat wip

    ggml-ci

    * ggml : alternative thread distribution for mul_mat

    ggml-ci

    * ggml : mul_mat block tiling attempt

    * ggml : mul_mat threads yield

    ggml-ci

commit 60baff7c8584ec369e53469cad5f92e102b1efe4
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 7 14:24:42 2023 +0300

    ggml : pad result of ggml_nbytes()

commit 9082b5dfbfae01243a0b822dcd2812877e63bf1b
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 7 13:55:18 2023 +0300

    ggml : change params pointer (style change) (#2539)

    ggml-ci

commit 99d29c0094476c4962023036ecd61a3309d0e16b
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 7 13:20:09 2023 +0300

    ggml : sync (custom ops) (#2537)

    ggml-ci

commit 9133e456d2d52b05c6c7f92cd94a0d2564ddb2f7
Merge: cae6a84 3d9a551
Author: Concedo <[email protected]>
Date:   Mon Aug 7 17:33:42 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	Makefile
    #	build.zig

commit cae6a847ada88e415b0beda09d70d79b51762618
Author: Concedo <[email protected]>
Date:   Mon Aug 7 16:40:13 2023 +0800

    cuda free only for non mmq (+2 squashed commit)

    Squashed commit:

    [3aca763a] only cuda free for non mmq

    [e69a8c9f] revert to pool alloc to try again

commit 3d9a55181603e85a26378a850a14068034e5002d
Author: Johannes Gäßler <[email protected]>
Date:   Mon Aug 7 10:09:40 2023 +0200

    Fixed mmap prefetch for GPU offloading (#2529)

commit f6f9896ac3d2ff207e18f87dab85d126ceef5236
Author: Georgi Gerganov <[email protected]>
Date:   Mon Aug 7 10:52:57 2023 +0300

    metal : fix out-of-bounds access + inc concurrency nodes (#2416)

    * metal : fix out-of-bounds access + style changes

    * metal : increase concurrency nodes to 2*GGML_MAX_NODES

commit 9f16a4c4efc5cca845e027c1dbad615612b9248c
Author: Concedo <[email protected]>
Date:   Mon Aug 7 15:16:37 2023 +0800

    switch to upstream implementation of pool malloc

commit 34a14b28ff7f3c98730339bacee035091b2a812a
Author: GiviMAD <[email protected]>
Date:   Sun Aug 6 23:21:46 2023 -0700

    [Makefile] Move ARM CFLAGS before compilation (#2536)

commit 7297128db8159c7b12db4c28a4532b993025c2e5
Author: Henri Vasserman <[email protected]>
Date:   Mon Aug 7 08:35:53 2023 +0300

    [Zig] Rewrite build for Zig 0.11 (#2514)

    * zig build fixes

    * Disable LTO on Windows.

commit 6659652c9fd1853dcb2d1882efc8f14b159d5d43
Author: Concedo <[email protected]>
Date:   Mon Aug 7 11:05:06 2023 +0800

    lower actual temp used when temp=0

commit 0e41b94f40e1d10893d6ac29c727482573ef1652
Author: Concedo <[email protected]>
Date:   Mon Aug 7 10:43:06 2023 +0800

    improve detection for 70B.

commit fb44d72a78a81790d238ffd2453cf66d02eed688
Merge: 559c0e2 d9024df
Author: Concedo <[email protected]>
Date:   Mon Aug 7 10:17:43 2023 +0800

    Merge remote-tracking branch 'johannes/cuda-fix-mmap-prefetch' into concedo_experimental

commit 559c0e2d1f621402d410944b5291da647243ab33
Author: Concedo <[email protected]>
Date:   Mon Aug 7 10:15:20 2023 +0800

    updated lite again, fix for wi

commit d9024df759b25d030fc8266d399c565fe7be9a04
Author: JohannesGaessler <[email protected]>
Date:   Sun Aug 6 10:18:05 2023 +0200

    Fixed mmap prefetch for GPU offloading

commit d442888626f11335e0c9e3b8555d2429b3262580
Merge: 198cc82 86c3219
Author: Concedo <[email protected]>
Date:   Sun Aug 6 22:47:33 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	Makefile

commit 198cc826fcb9…
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
…anov#2685)

* Fix import of llama2.c models that don't share weights between embedding layers

* llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv

* llama2.c: comment out legacy "load from ggml model" logic

* llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants