File size: 1,514 Bytes
e7c3249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Test that the Open LLM is running

First start the server by using only CPU:

```bash
export model_path="TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf"
python -m llama_cpp.server --model $model_path
```

Or with GPU support (recommended):

```bash
python -m llama_cpp.server --model TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf --n_gpu_layers 1
```

If you have more `GPU` layers available set `--n_gpu_layers` to the higher number.

To find the amount of available  run the above command and look for `llm_load_tensors: offloaded 1/41 layers to GPU` in the output.

## Test API call

Set the environment variables:

```bash
export OPENAI_API_BASE="http://localhost:8000/v1"
export OPENAI_API_KEY="sk-xxx"
export MODEL_NAME="CodeLlama"
````

Then ping the model via `python` using `OpenAI` API:

```bash
python examples/open_llms/openai_api_interface.py
```

If you're not using `CodeLLama` make sure to change the `MODEL_NAME` parameter.

Or using `curl`:

```bash
curl --request POST \
     --url http://localhost:8000/v1/chat/completions \
     --header "Content-Type: application/json" \
     --data '{ "model": "CodeLlama", "prompt": "Who are you?", "max_tokens": 60}'
```

If this works also make sure that `langchain` interface works since that's how `gpte` interacts with LLMs.

## Langchain test

```bash
export MODEL_NAME="CodeLlama"
python examples/open_llms/langchain_interface.py
```

That's it 🤓 time to go back [to](/docs/open_models.md#running-the-example) and give `gpte` a try.