--- base_model: ibm-granite/granite-3.1-8b-instruct tags: - text-generation - transformers - gguf - english - granite - text-generation-inference - inference-endpoints - conversational - 4-bit - 5-bit - 8-bit - ruslanmv license: apache-2.0 language: - en --- # Granite-3.1-8B-Reasoning-GGUF (Quantized for Efficient Inference) ## Model Overview This is a **GGUF quantized version** of **ruslanmv/granite-3.1-8b-Reasoning**, fine-tuned from **ibm-granite/granite-3.1-8b-instruct**. The **GGUF format** enables efficient inference on **CPUs and GPUs**, optimized for various **K-bit quantization levels** (4-bit, 5-bit, and 8-bit). - **Developed by:** [ruslanmv](https://huggingface.co./ruslanmv) - **License:** Apache 2.0 - **Base Model:** [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co./ibm-granite/granite-3.1-8b-instruct) - **Fine-tuned for:** Logical reasoning, structured problem-solving, long-context tasks - **Quantized GGUF versions available:** - **4-bit:** `Q4_K_M` - **5-bit:** `Q5_K_M` - **8-bit:** `Q8_0` - **Supported Languages:** English - **Architecture:** **Granite** - **Model Size:** **8.17B params** --- ## Why Use the GGUF Quantized Version? The **GGUF format** is designed for optimized **CPU and GPU inference**, making it ideal for: ✅ **Lower memory usage** for efficient deployment ✅ **Faster inference speeds** on consumer hardware ✅ **Compatibility with leading inference engines** like **llama.cpp, ctransformers, and KoboldCpp** ✅ **Improved performance on logical reasoning and analytical tasks** --- ## Installation & Usage ### Install dependencies for **llama.cpp**: ```bash pip install llama-cpp-python ``` ### Running the Model with **llama.cpp**: ```python from llama_cpp import Llama model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf" llm = Llama(model_path=model_path) input_text = "Can you explain the difference between inductive and deductive reasoning?" output = llm(input_text, max_tokens=400) print(output["choices"][0]["text"]) ``` ### Alternatively, using **ctransformers**: ```bash pip install ctransformers ``` ```python from ctransformers import AutoModelForCausalLM model_path = "path/to/ruslanmv/granite-3.1-8b-Reasoning-GGUF.Q4_K_M.gguf" model = AutoModelForCausalLM.from_pretrained(model_path, model_type="llama", gpu_layers=50) input_text = "What are the key principles of logical reasoning?" output = model(input_text, max_new_tokens=400) print(output) ``` --- ## Intended Use Granite-3.1-8B-Reasoning-GGUF is designed for **efficient inference** while maintaining strong **reasoning capabilities**, making it ideal for: - **Logical and analytical problem-solving** - **Text-based reasoning tasks** - **Mathematical and symbolic reasoning** - **Advanced instruction-following** This model is particularly beneficial for **CPU-based deployments**, **low-memory environments**, and users who need **optimized text generation without requiring high-end GPUs**. --- ## License & Acknowledgments This model is released under the **Apache 2.0** license. It is fine-tuned from IBM’s **Granite 3.1-8B-Instruct** model and **quantized using GGUF** for optimal efficiency. Special thanks to the **IBM Granite Team** for developing the base model. For more details, visit the [IBM Granite Documentation](https://huggingface.co./ibm-granite). --- ### Citation If you use this model in your research or applications, please cite: ``` @misc{ruslanmv2025granite, title={Fine-Tuning and GGUF Quantization of Granite-3.1-8B for Advanced Reasoning}, author={Ruslan M.V.}, year={2025}, url={https://huggingface.co./ruslanmv/granite-3.1-8b-Reasoning-GGUF} } ```