Arrcttacsrks commited on
Commit
eb6de2e
·
verified ·
1 Parent(s): e871917

Upload llama.cpp/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. llama.cpp/README.md +486 -0
llama.cpp/README.md ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # llama.cpp
2
+
3
+ ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
4
+
5
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
6
+ [![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
7
+ [![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp)
8
+
9
+ [Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
10
+
11
+ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others) in pure C/C++
12
+
13
+ ## Recent API changes
14
+
15
+ - [Changelog for `libllama` API](https://github.com/ggerganov/llama.cpp/issues/9289)
16
+ - [Changelog for `llama-server` REST API](https://github.com/ggerganov/llama.cpp/issues/9291)
17
+
18
+ ## Hot topics
19
+
20
+ - **Introducing GGUF-my-LoRA** https://github.com/ggerganov/llama.cpp/discussions/10123
21
+ - Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggerganov/llama.cpp/discussions/9669
22
+ - Hugging Face GGUF editor: [discussion](https://github.com/ggerganov/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
23
+
24
+ ----
25
+
26
+ ## Description
27
+
28
+ The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
29
+ variety of hardware - locally and in the cloud.
30
+
31
+ - Plain C/C++ implementation without any dependencies
32
+ - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
33
+ - AVX, AVX2, AVX512 and AMX support for x86 architectures
34
+ - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
35
+ - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
36
+ - Vulkan and SYCL backend support
37
+ - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
38
+
39
+ Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
40
+ improved significantly thanks to many contributions. It is the main playground for developing new features for the
41
+ [ggml](https://github.com/ggerganov/ggml) library.
42
+
43
+ **Supported models:**
44
+
45
+ Typically finetunes of the base models below are supported as well.
46
+
47
+ - [X] LLaMA 🦙
48
+ - [x] LLaMA 2 🦙🦙
49
+ - [x] LLaMA 3 🦙🦙🦙
50
+ - [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
51
+ - [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
52
+ - [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
53
+ - [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
54
+ - [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
55
+ - [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
56
+ - [X] [BERT](https://github.com/ggerganov/llama.cpp/pull/5423)
57
+ - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
58
+ - [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
59
+ - [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
60
+ - [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
61
+ - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
62
+ - [X] [MPT](https://github.com/ggerganov/llama.cpp/pull/3417)
63
+ - [X] [Bloom](https://github.com/ggerganov/llama.cpp/pull/3553)
64
+ - [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
65
+ - [X] [StableLM models](https://huggingface.co/stabilityai)
66
+ - [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
67
+ - [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
68
+ - [x] [PLaMo-13B](https://github.com/ggerganov/llama.cpp/pull/3557)
69
+ - [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
70
+ - [x] [GPT-2](https://huggingface.co/gpt2)
71
+ - [x] [Orion 14B](https://github.com/ggerganov/llama.cpp/pull/5118)
72
+ - [x] [InternLM2](https://huggingface.co/models?search=internlm2)
73
+ - [x] [CodeShell](https://github.com/WisdomShell/codeshell)
74
+ - [x] [Gemma](https://ai.google.dev/gemma)
75
+ - [x] [Mamba](https://github.com/state-spaces/mamba)
76
+ - [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
77
+ - [x] [Xverse](https://huggingface.co/models?search=xverse)
78
+ - [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
79
+ - [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
80
+ - [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
81
+ - [x] [OLMo](https://allenai.org/olmo)
82
+ - [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
83
+ - [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
84
+ - [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
85
+ - [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
86
+ - [x] [Smaug](https://huggingface.co/models?search=Smaug)
87
+ - [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
88
+ - [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
89
+ - [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
90
+ - [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
91
+ - [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b)
92
+ - [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
93
+ - [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
94
+ - [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
95
+ - [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
96
+ - [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
97
+ - [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
98
+
99
+ (instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))
100
+
101
+ **Multimodal models:**
102
+
103
+ - [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
104
+ - [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
105
+ - [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
106
+ - [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
107
+ - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
108
+ - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
109
+ - [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
110
+ - [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
111
+ - [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
112
+
113
+ **Bindings:**
114
+
115
+ - Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
116
+ - Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
117
+ - Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
118
+ - JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
119
+ - JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
120
+ - JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
121
+ - Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
122
+ - Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
123
+ - Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
124
+ - Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
125
+ - Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
126
+ - C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
127
+ - C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
128
+ - Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
129
+ - Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
130
+ - React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
131
+ - Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
132
+ - Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
133
+ - Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
134
+ - PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
135
+ - Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
136
+ - Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
137
+ - Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
138
+
139
+ **UI:**
140
+
141
+ Unless otherwise noted these projects are open-source with permissive licensing:
142
+
143
+ - [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
144
+ - [iohub/collama](https://github.com/iohub/coLLaMA)
145
+ - [janhq/jan](https://github.com/janhq/jan) (AGPL)
146
+ - [nat/openplayground](https://github.com/nat/openplayground)
147
+ - [Faraday](https://faraday.dev/) (proprietary)
148
+ - [LMStudio](https://lmstudio.ai/) (proprietary)
149
+ - [Layla](https://play.google.com/store/apps/details?id=com.laylalite) (proprietary)
150
+ - [ramalama](https://github.com/containers/ramalama) (MIT)
151
+ - [LocalAI](https://github.com/mudler/LocalAI) (MIT)
152
+ - [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
153
+ - [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile)
154
+ - [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all)
155
+ - [ollama/ollama](https://github.com/ollama/ollama)
156
+ - [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
157
+ - [psugihara/FreeChat](https://github.com/psugihara/FreeChat)
158
+ - [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
159
+ - [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
160
+ - [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
161
+ - [RAGNA Desktop](https://ragna.app/) (proprietary)
162
+ - [RecurseChat](https://recurse.chat/) (proprietary)
163
+ - [semperai/amica](https://github.com/semperai/amica)
164
+ - [withcatai/catai](https://github.com/withcatai/catai)
165
+ - [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
166
+ - [Msty](https://msty.app) (proprietary)
167
+ - [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
168
+ - [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apachev2.0 or later)
169
+ - [Dot](https://github.com/alexpinel/Dot) (GPL)
170
+ - [MindMac](https://mindmac.app) (proprietary)
171
+ - [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
172
+ - [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
173
+ - [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
174
+ - [AIKit](https://github.com/sozercan/aikit) (MIT)
175
+ - [LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS) (AGPL)
176
+ - [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
177
+ - [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
178
+ - [PocketPal AI - An iOS and Android App](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
179
+
180
+ *(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
181
+
182
+ **Tools:**
183
+
184
+ - [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
185
+ - [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
186
+ - [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
187
+ - [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
188
+ - [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with prebuild Mobile and Web platform wrappers and a model example)
189
+
190
+ **Infrastructure:**
191
+
192
+ - [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
193
+ - [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
194
+ - [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
195
+
196
+ **Games:**
197
+ - [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
198
+
199
+ ## Demo
200
+
201
+ <details>
202
+ <summary>Typical run using LLaMA v2 13B on M2 Ultra</summary>
203
+
204
+ ```
205
+ $ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
206
+ I llama.cpp build info:
207
+ I UNAME_S: Darwin
208
+ I UNAME_P: arm
209
+ I UNAME_M: arm64
210
+ I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
211
+ I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
212
+ I LDFLAGS: -framework Accelerate
213
+ I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
214
+ I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
215
+
216
+ make: Nothing to be done for `default'.
217
+ main: build = 1041 (cf658ad)
218
+ main: seed = 1692823051
219
+ llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
220
+ llama_model_loader: - type f32: 81 tensors
221
+ llama_model_loader: - type q4_0: 281 tensors
222
+ llama_model_loader: - type q6_K: 1 tensors
223
+ llm_load_print_meta: format = GGUF V1 (latest)
224
+ llm_load_print_meta: arch = llama
225
+ llm_load_print_meta: vocab type = SPM
226
+ llm_load_print_meta: n_vocab = 32000
227
+ llm_load_print_meta: n_merges = 0
228
+ llm_load_print_meta: n_ctx_train = 4096
229
+ llm_load_print_meta: n_ctx = 512
230
+ llm_load_print_meta: n_embd = 5120
231
+ llm_load_print_meta: n_head = 40
232
+ llm_load_print_meta: n_head_kv = 40
233
+ llm_load_print_meta: n_layer = 40
234
+ llm_load_print_meta: n_rot = 128
235
+ llm_load_print_meta: n_gqa = 1
236
+ llm_load_print_meta: f_norm_eps = 1.0e-05
237
+ llm_load_print_meta: f_norm_rms_eps = 1.0e-05
238
+ llm_load_print_meta: n_ff = 13824
239
+ llm_load_print_meta: freq_base = 10000.0
240
+ llm_load_print_meta: freq_scale = 1
241
+ llm_load_print_meta: model type = 13B
242
+ llm_load_print_meta: model ftype = mostly Q4_0
243
+ llm_load_print_meta: model size = 13.02 B
244
+ llm_load_print_meta: general.name = LLaMA v2
245
+ llm_load_print_meta: BOS token = 1 '<s>'
246
+ llm_load_print_meta: EOS token = 2 '</s>'
247
+ llm_load_print_meta: UNK token = 0 '<unk>'
248
+ llm_load_print_meta: LF token = 13 '<0x0A>'
249
+ llm_load_tensors: ggml ctx size = 0.11 MB
250
+ llm_load_tensors: mem required = 7024.01 MB (+ 400.00 MB per state)
251
+ ...................................................................................................
252
+ llama_new_context_with_model: kv self size = 400.00 MB
253
+ llama_new_context_with_model: compute buffer total size = 75.41 MB
254
+
255
+ system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
256
+ sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
257
+ generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
258
+
259
+
260
+ Building a website can be done in 10 simple steps:
261
+ Step 1: Find the right website platform.
262
+ Step 2: Choose your domain name and hosting plan.
263
+ Step 3: Design your website layout.
264
+ Step 4: Write your website content and add images.
265
+ Step 5: Install security features to protect your site from hackers or spammers
266
+ Step 6: Test your website on multiple browsers, mobile devices, operating systems etc…
267
+ Step 7: Test it again with people who are not related to you personally – friends or family members will work just fine!
268
+ Step 8: Start marketing and promoting the website via social media channels or paid ads
269
+ Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc…
270
+ Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
271
+ How does a Website Work?
272
+ A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit – whether it’s an image or text file (like PDFs). In order for someone else’s browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
273
+ The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols – this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
274
+ How to
275
+ llama_print_timings: load time = 576.45 ms
276
+ llama_print_timings: sample time = 283.10 ms / 400 runs ( 0.71 ms per token, 1412.91 tokens per second)
277
+ llama_print_timings: prompt eval time = 599.83 ms / 19 tokens ( 31.57 ms per token, 31.68 tokens per second)
278
+ llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms per token, 16.28 tokens per second)
279
+ llama_print_timings: total time = 25431.49 ms
280
+ ```
281
+
282
+ </details>
283
+
284
+ <details>
285
+ <summary>Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook</summary>
286
+
287
+ And here is another demo of running both LLaMA-7B and [whisper.cpp](https://github.com/ggerganov/whisper.cpp) on a single M1 Pro MacBook:
288
+
289
+ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4
290
+
291
+ </details>
292
+
293
+ ## Usage
294
+
295
+ Here are the end-to-end binary build and model conversion steps for most supported models.
296
+
297
+ ### Basic usage
298
+
299
+ Firstly, you need to get the binary. There are different methods that you can follow:
300
+ - Method 1: Clone this repository and build locally, see [how to build](./docs/build.md)
301
+ - Method 2: If you are using MacOS or Linux, you can install llama.cpp via [brew, flox or nix](./docs/install.md)
302
+ - Method 3: Use a Docker image, see [documentation for Docker](./docs/docker.md)
303
+ - Method 4: Download pre-built binary from [releases](https://github.com/ggerganov/llama.cpp/releases)
304
+
305
+ You can run a basic completion using this command:
306
+
307
+ ```bash
308
+ llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
309
+
310
+ # Output:
311
+ # I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga – it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
312
+ ```
313
+
314
+ See [this page](./examples/main/README.md) for a full list of parameters.
315
+
316
+ ### Conversation mode
317
+
318
+ If you want a more ChatGPT-like experience, you can run in conversation mode by passing `-cnv` as a parameter:
319
+
320
+ ```bash
321
+ llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
322
+
323
+ # Output:
324
+ # > hi, who are you?
325
+ # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
326
+ #
327
+ # > what is 1+1?
328
+ # Easy peasy! The answer to 1+1 is... 2!
329
+ ```
330
+
331
+ By default, the chat template will be taken from the input model. If you want to use another chat template, pass `--chat-template NAME` as a parameter. See the list of [supported templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template)
332
+
333
+ ```bash
334
+ ./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
335
+ ```
336
+
337
+ You can also use your own template via in-prefix, in-suffix and reverse-prompt parameters:
338
+
339
+ ```bash
340
+ ./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
341
+ ```
342
+
343
+ ### Web server
344
+
345
+ [llama.cpp web server](./examples/server/README.md) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
346
+
347
+ Example usage:
348
+
349
+ ```bash
350
+ ./llama-server -m your_model.gguf --port 8080
351
+
352
+ # Basic web UI can be accessed via browser: http://localhost:8080
353
+ # Chat completion endpoint: http://localhost:8080/v1/chat/completions
354
+ ```
355
+
356
+ ### Interactive mode
357
+
358
+ > [!NOTE]
359
+ > If you prefer basic usage, please consider using conversation mode instead of interactive mode
360
+
361
+ In this mode, you can always interrupt generation by pressing Ctrl+C and entering one or more lines of text, which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt that makes LLaMA emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
362
+
363
+ Here is an example of a few-shot interaction, invoked with the command
364
+
365
+ ```bash
366
+ # default arguments using a 7B model
367
+ ./examples/chat.sh
368
+
369
+ # advanced chat with a 13B model
370
+ ./examples/chat-13B.sh
371
+
372
+ # custom arguments using a 13B model
373
+ ./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
374
+ ```
375
+
376
+ Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `llama-cli` example program.
377
+
378
+ ![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
379
+
380
+ ### Persistent Interaction
381
+
382
+ The prompt, user inputs, and model generations can be saved and resumed across calls to `./llama-cli` by leveraging `--prompt-cache` and `--prompt-cache-all`. The `./examples/chat-persistent.sh` script demonstrates this with support for long-running, resumable chat sessions. To use this example, you must provide a file to cache the initial chat prompt and a directory to save the chat session, and may optionally provide the same variables as `chat-13B.sh`. The same prompt cache can be reused for new chat sessions. Note that both prompt cache and chat directory are tied to the initial prompt (`PROMPT_TEMPLATE`) and the model file.
383
+
384
+ ```bash
385
+ # Start a new chat
386
+ PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
387
+
388
+ # Resume that chat
389
+ PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
390
+
391
+ # Start a different chat with the same prompt/model
392
+ PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
393
+
394
+ # Different prompt cache for different prompt/model
395
+ PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
396
+ CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
397
+ ```
398
+
399
+ ### Constrained output with grammars
400
+
401
+ `llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
402
+
403
+ ```bash
404
+ ./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
405
+ ```
406
+
407
+ The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
408
+
409
+ For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
410
+
411
+ ## Build
412
+
413
+ Please refer to [Build llama.cpp locally](./docs/build.md)
414
+
415
+ ## Supported backends
416
+
417
+ | Backend | Target devices |
418
+ | --- | --- |
419
+ | [Metal](./docs/build.md#metal-build) | Apple Silicon |
420
+ | [BLAS](./docs/build.md#blas-build) | All |
421
+ | [BLIS](./docs/backend/BLIS.md) | All |
422
+ | [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
423
+ | [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
424
+ | [CUDA](./docs/build.md#cuda) | Nvidia GPU |
425
+ | [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
426
+ | [Vulkan](./docs/build.md#vulkan) | GPU |
427
+ | [CANN](./docs/build.md#cann) | Ascend NPU |
428
+
429
+ ## Tools
430
+
431
+ ### Prepare and Quantize
432
+
433
+ > [!NOTE]
434
+ > You can use the [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space on Hugging Face to quantise your model weights without any setup too. It is synced from `llama.cpp` main every 6 hours.
435
+
436
+ To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
437
+
438
+ Note: `convert.py` has been moved to `examples/convert_legacy_llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
439
+ It does not support LLaMA 3, you can use `convert_hf_to_gguf.py` with LLaMA 3 downloaded from Hugging Face.
440
+
441
+ To learn more about quantizing model, [read this documentation](./examples/quantize/README.md)
442
+
443
+ ### Perplexity (measuring model quality)
444
+
445
+ You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
446
+ For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
447
+
448
+ To learn more how to measure perplexity using llama.cpp, [read this documentation](./examples/perplexity/README.md)
449
+
450
+ ## Contributing
451
+
452
+ - Contributors can open PRs
453
+ - Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
454
+ - Collaborators will be invited based on contributions
455
+ - Any help with managing issues, PRs and projects is very appreciated!
456
+ - See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
457
+ - Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
458
+ - Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
459
+ - A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
460
+
461
+ ## Other documentations
462
+
463
+ - [main (cli)](./examples/main/README.md)
464
+ - [server](./examples/server/README.md)
465
+ - [jeopardy](./examples/jeopardy/README.md)
466
+ - [GBNF grammars](./grammars/README.md)
467
+
468
+ **Development documentations**
469
+
470
+ - [How to build](./docs/build.md)
471
+ - [Running on Docker](./docs/docker.md)
472
+ - [Build on Android](./docs/android.md)
473
+ - [Performance troubleshooting](./docs/development/token_generation_performance_tips.md)
474
+ - [GGML tips & tricks](https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks)
475
+
476
+ **Seminal papers and background on the models**
477
+
478
+ If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
479
+ - LLaMA:
480
+ - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
481
+ - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
482
+ - GPT-3
483
+ - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
484
+ - GPT-3.5 / InstructGPT / ChatGPT:
485
+ - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
486
+ - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)