How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)

Community Article Published August 14, 2024

In this tutorial, we'll walk you through the process of setting up and using Ollama for private model inference on a VM with GPU, either on your local machine or a rented VM from Vast.aior Runpod.io. Ollama allows you to run models privately, ensuring data security and faster inference times thanks to the power of GPUs. By leveraging a GPU-powered VM, you can significantly improve the performance and efficiency of your model inference tasks.

Outline

Set up a VM with GPU on Vast.ai
Start Jupyter Terminal
Install Ollama
Run Ollama Serve
Test Ollama with a model
(Optional) using your own model

Setting Up a VM with GPU on Vast.ai

1. Create a VM with GPU: --- Visit Vast.ai to create your VM. --- Choose a VM with at least 30 GB of storage to accommodate the models. This ensures you have enough space for installation and model storage. --- Select a VM that costs less than $0.30 per hour to keep the setup cost-effective.

2. Start Jupyter Terminal: --- Once your VM is up and running, start Jupyter and open a terminal within it.

Downloading and Running Ollama

Start Jupyter Terminal: --- Once your VM is up and running, start Jupyter and open a terminal within it. This is the easiest method to get started. --- Alternatively, you can use SSH on your local VM, for example with VSCode, but you will need to create an SSH key to use it.

Install Ollama: --- Open the terminal in Jupyter and run the following command to install Ollama:

bash curl -fsSL https://ollama.com/install.sh | sh

2. Run Ollama Serve: --- After installation, start the Ollama service by running:

bash ollama serve &

Ensure there are no GPU errors. If there are issues, the response will be slow when interacting with the model.

3. Test Ollama with a Model: --- Test the setup by running a sample model like Mistral:

bash ollama run mistral

You can now start chatting with the model to ensure everything is working correctly.

Optional (Check GPU usage)

Check GPU Utilization: --- During the inference (last step), check if the GPU is being utilized by running the following command:bash nvidia-smi - Ensure that the memory utilization is greater than 0%. This indicates that the GPU is being used for the inference process.

Using Your Own Hugging Face Model with Ollama

1. Install Hugging Face CLI: --- If you want to use your own model from Hugging Face, first install the Hugging Face CLI. Here we will use an example of a fine tuned Mistral model TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf

2. Download Your Model: --- Download your desired model from Hugging Face. For example, to download a fine-tuned Mistral model:

pip3 install huggingface-hub# Try with my custom model for fine tuned Mistral
huggingface-cli download TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

3. Create a Model File: --- Create a model config file ***Modelfile ***with the following content:

FROM em_german_mistral_v01.Q4_K_M.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0

# # set the system message
# SYSTEM """
# You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
# """

4. Instruct Ollama to Create the Model: --- Create the custom model using Ollama with the command:

ollama create -f mymodel Modelfile

5. Run Your Custom Model: --- Run your custom model using:

ollama run mymodel

By following these steps, you can effectively utilize Ollama for private model inference on a VM with GPU, ensuring secure and efficient operations for your machine learning projects.

Happy prompting!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote