---
base_model: ibm-granite/granite-3.1-8b-instruct
tags:
- text-generation
- transformers
- gguf
- english
- granite
- text-generation-inference
- inference-endpoints
- conversational
- 4-bit
- 5-bit
- 8-bit
- ruslanmv
license: apache-2.0
language:
- en
---

# Granite-3.1-8B-Reasoning-GGUF (Quantized for Efficient Inference)

## Model Overview

This is a **GGUF quantized version** of **ruslanmv/granite-3.1-8b-Reasoning**, fine-tuned from **ibm-granite/granite-3.1-8b-instruct**. The **GGUF format** enables efficient inference on **CPUs and GPUs**, optimized for various **K-bit quantization levels** (4-bit, 5-bit, and 8-bit).

- **Developed by:** [ruslanmv](https://huggingface.co./ruslanmv)  
- **License:** Apache 2.0  
- **Base Model:** [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co./ibm-granite/granite-3.1-8b-instruct)  
- **Fine-tuned for:** Logical reasoning, structured problem-solving, long-context tasks  
- **Quantized GGUF versions available:**  
  - **4-bit:** `Q4_K_M`  
  - **5-bit:** `Q5_K_M`  
  - **8-bit:** `Q8_0`  
- **Supported Languages:** English  
- **Architecture:** **Granite**  
- **Model Size:** **8.17B params**  

---

## Why Use the GGUF Quantized Version?

The **GGUF format** is designed for optimized **CPU and GPU inference**, making it ideal for:  

✅ **Lower memory usage** for efficient deployment  
✅ **Faster inference speeds** on consumer hardware  
✅ **Compatibility with leading inference engines** like **llama.cpp, ctransformers, and KoboldCpp**  
✅ **Improved performance on logical reasoning and analytical tasks**  

---

## Installation & Usage  

### Install dependencies for **llama.cpp**:

```bash
pip install llama-cpp-python
```

### Running the Model with **llama.cpp**:

```python
from llama_cpp import Llama

model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf"

llm = Llama(model_path=model_path)

input_text = "Can you explain the difference between inductive and deductive reasoning?"
output = llm(input_text, max_tokens=400)

print(output["choices"][0]["text"])
```

### Alternatively, using **ctransformers**:

```bash
pip install ctransformers
```

```python
from ctransformers import AutoModelForCausalLM

model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf"

model = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", gpu_layers=50)

input_text = "What are the key principles of logical reasoning?"
output = model(input_text, max_new_tokens=400)

print(output)
```

---

## Intended Use  

Granite-3.1-8B-Reasoning-GGUF is designed for **efficient inference** while maintaining strong **reasoning capabilities**, making it ideal for:  

- **Logical and analytical problem-solving**  
- **Text-based reasoning tasks**  
- **Mathematical and symbolic reasoning**  
- **Advanced instruction-following**  

This model is particularly beneficial for **CPU-based deployments**, **low-memory environments**, and users who need **optimized text generation without requiring high-end GPUs**.

---

## License & Acknowledgments  

This model is released under the **Apache 2.0** license. It is fine-tuned from IBM’s **Granite 3.1-8B-Instruct** model and **quantized using GGUF** for optimal efficiency. Special thanks to the **IBM Granite Team** for developing the base model.  

For more details, visit the [IBM Granite Documentation](https://huggingface.co./ibm-granite).  

---

### Citation  

If you use this model in your research or applications, please cite:  

```
@misc{ruslanmv2025granite,
  title={Fine-Tuning and GGUF Quantization of Granite-3.1-8B for Advanced Reasoning},
  author={Ruslan M.V.},
  year={2025},
  url={https://huggingface.co./ruslanmv/granite-3.1-8b-Reasoning-GGUF}
}
```