Triangle104/granite-3.2-8b-instruct-Q8_0-GGUF

This model was converted to GGUF format from ibm-granite/granite-3.2-8b-instruct using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Model Summary:

Granite-3.2-8B-Instruct is an 8-billion-parameter, long-context AI model fine-tuned for thinking capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required.

Developers: Granite Team, IBM Website: Granite Docs Release Date: February 26th, 2025 License: Apache 2.0

Supported Languages:

English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages.

Intended Use:

This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications.

Capabilities

Thinking Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc.

Generation:

This is a simple example of how to use Granite-3.2-8B-Instruct model.

Install the following libraries:

pip install torch torchvision torchaudio pip install accelerate pip install transformers

Then, copy the snippet from the section that is relevant for your use case.

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed import torch

model_path="ibm-granite/granite-3.2-8b-instruct" device="cuda" model = AutoModelForCausalLM.from_pretrained( model_path, device_map=device, torch_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained( model_path )

conv = [{"role": "user", "content":"You have 10 liters of a 30% acid solution. How many liters of a 70% acid solution must be added to achieve a 50% acid mixture?"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=True, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42) output = model.generate( **input_ids, max_new_tokens=8192, )

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True) print(prediction)

Example Outputs

thinking=True

Here is my thought process:

We know we start with 10 liters of a 30% acid solution, so there's 3 liters of pure acid in it (10 liters * 0.3).
We need to determine how much of the 70% acid solution to add such that the final mixture is 50% acid.
Let's say x liters of the 70% acid solution are added. This will contribute 0.7x liters of pure acid.
The total volume of the new mixture will be 10 + x liters, and the total amount of pure acid will be 3 + 0.7x liters.
We need to set up an equation where (3 + 0.7x) / (10 + x) = 0.5 (since we want a 50% acid mixture).
Solving this equation for x will give us the volume of 70% solution needed.

Here is my response:

To solve this, we need to set up an equation based on the concentration of acid before and after mixing.

Initial Acid Content:
- We start with 10 liters of a 30% acid solution.
- Pure acid in this = 10 * 0.3 = 3 liters.
Variable for Addition:
- Let x be the liters of 70% acid solution to be added.
- Pure acid in x liters of 70% solution = 0.7x liters.
Final Mixture Concentration:
- Total volume of mixture = 10 + x liters.
- Total pure acid in the mixture = 3 (from initial solution) + 0.7x (from added solution).
Setting up the equation for 50% acid concentration:
- We require that the ratio of pure acid to total volume is 0.5 (50%).
- Therefore, the equation is:
[ \frac{3 + 0.7x}{10 + x} = 0.5 ]
Solving the equation:

Multiply both sides by (10 + x) to clear the fraction:

[ 3 + 0.7x = 0.5(10 + x) ]

Expanding the right side:

[ 3 + 0.7x = 5 + 0.5x ]

Subtract 0.5x from both sides:

[ 3 + 0.2x = 5 ]

Subtract 3 from both sides:

[ 0.2x = 2 ]

Divide by 0.2:

[ x = 10 ]

So, 10 liters of the 70% acid solution must be added to the initial 10 liters of 30% solution to get a 50% acid mixture.

Let's verify:

Initial acid: 3 liters
Acid from added solution: 0.7 * 10 = 7 liters
Total acid: 3 + 7 = 10 liters
Total volume: 10 + 10 = 20 liters
Acid percentage: (\frac{10}{20} = 0.5 = 50%)

This confirms our solution.

Final answer: 10 liters of 70% acid solution should be added.

thinking=False

Let's denote the amount of 70% acid solution to be added as x liters.

We start with 10 liters of a 30% acid solution, which means there are 0.3 * 10 = 3 liters of pure acid in it.

We want to end up with a 50% acid solution after adding x liters of 70% acid solution. The total volume of the mixture will be 10 + x liters, and we want half of this volume to be pure acid.

So, the amount of pure acid in the final mixture should be 0.5 * (10 + x) liters.

The amount of pure acid in the final mixture comes from two sources: the original 30% solution and the added 70% solution. So, we can set up the following equation:

3 (from the original solution) + 0.7x (from the added solution) = 0.5 * (10 + x)

Now, let's solve for x:

3 + 0.7x = 5 + 0.5x 0.7x - 0.5x = 5 - 3 0.2x = 2 x = 2 / 0.2 x = 10

So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% acid solution to get a 50% acid mixture.

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo Triangle104/granite-3.2-8b-instruct-Q8_0-GGUF --hf-file granite-3.2-8b-instruct-q8_0.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo Triangle104/granite-3.2-8b-instruct-Q8_0-GGUF --hf-file granite-3.2-8b-instruct-q8_0.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo Triangle104/granite-3.2-8b-instruct-Q8_0-GGUF --hf-file granite-3.2-8b-instruct-q8_0.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo Triangle104/granite-3.2-8b-instruct-Q8_0-GGUF --hf-file granite-3.2-8b-instruct-q8_0.gguf -c 2048

Triangle104
/

granite-3.2-8b-instruct-Q8_0-GGUF