Install the latest wasmedge with plugins:
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s
Download the wasm file:
curl -LO https://github.com/LlamaEdge/LlmaEdge/releases/latest/download/llama-simple.wasm
Download llama model:
curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf
Execute the WASM with the wasmedge
using the named model feature to preload large model:
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat.Q5_K_M.gguf llama-simple.wasm \
--prompt 'Robert Oppenheimer most important achievement is ' --ctx-size 4096
-
The CLI options of
llama-simple
wasm app:~/llama-utils/simple$ wasmedge llama-simple.wasm -h Usage: llama-simple.wasm [OPTIONS] --prompt <PROMPT> Options: -p, --prompt <PROMPT> Sets the prompt string, including system message if required. -m, --model-alias <ALIAS> Sets the model alias [default: default] -c, --ctx-size <CTX_SIZE> Sets the prompt context size [default: 4096] -n, --n-predict <N_PRDICT> Number of tokens to predict [default: 1024] -g, --n-gpu-layers <N_GPU_LAYERS> Number of layers to run on the GPU [default: 100] --no-mmap Disable memory mapping for file access of chat models -b, --batch-size <BATCH_SIZE> Batch size for prompt processing [default: 4096] -r, --reverse-prompt <REVERSE_PROMPT> Halt generation at PROMPT, return control. --log-enable Enable trace logs -h, --help Print help -V, --version Print version
After executing the command, it takes some time to wait for the output. Once the execution is complete, the following output will be generated:
...................................................................................................
[2023-10-08 23:13:10.272] [info] [WASI-NN] GGML backend: set n_ctx to 4096
llama_new_context_with_model: kv self size = 2048.00 MB
llama_new_context_with_model: compute buffer total size = 297.47 MB
llama_new_context_with_model: max tensor size = 102.54 MB
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
[2023-10-08 23:13:10.472] [info] [WASI-NN] GGML backend: set n_predict to 128
[2023-10-08 23:13:16.014] [info] [WASI-NN] GGML backend: llama_get_kv_cache_token_count 128
llama_print_timings: load time = 1431.58 ms
llama_print_timings: sample time = 3.53 ms / 118 runs ( 0.03 ms per token, 33446.71 tokens per second)
llama_print_timings: prompt eval time = 1230.69 ms / 11 tokens ( 111.88 ms per token, 8.94 tokens per second)
llama_print_timings: eval time = 4295.81 ms / 117 runs ( 36.72 ms per token, 27.24 tokens per second)
llama_print_timings: total time = 5742.71 ms
Robert Oppenheimer most important achievement is
1945 Manhattan Project.
Robert Oppenheimer was born in New York City on April 22, 1904. He was the son of Julius Oppenheimer, a wealthy German-Jewish textile merchant, and Ella Friedman Oppenheimer.
Robert Oppenheimer was a brilliant student. He attended the Ethical Culture School in New York City and graduated from the Ethical Culture Fieldston School in 1921. He then attended Harvard University, where he received his bachelor's degree
Compile the application to WebAssembly:
cargo build --target wasm32-wasi --release
The output wasm file will be at target/wasm32-wasi/release/
.