File size: 3,067 Bytes

816ad03
 
 
 
 
 
 
 
 
 
f902ac3
816ad03
 
 
 
 
 
0bcd329
816ad03
 
 
 
 
0bcd329
f902ac3
 
 
816ad03
f902ac3
816ad03
 
 
edf88ee
816ad03
b3b9cd2
d262277
edf88ee
 
 
816ad03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bcd329
816ad03
676332c
816ad03
 
 
b3b9cd2
816ad03
 
 
0bcd329
816ad03
 
 
 
 
 
 
 
 
 
 
 
 
 
676332c

---
base_model: teknium/OpenHermes-2.5-Mistral-7B
inference: false
model_type: mistral
prompt_template: |
  <|im_start|>system
  {system_message}<|im_end|>
  <|im_start|>user
  {prompt}<|im_end|>
  <|im_start|>assistant
quantized_by: mgoin
tags:
- deepsparse
---

# OpenHermes 2.5 Mistral 7B - DeepSparse

This repo contains model files for [Teknium's OpenHermes 2.5 Mistral 7B](https://huggingface.co./teknium/OpenHermes-2.5-Mistral-7B) optimized for [DeepSparse](https://github.com/neuralmagic/deepsparse), a CPU inference runtime for sparse models.

This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).

## Inference

Install [DeepSparse LLM](https://github.com/neuralmagic/deepsparse) for fast inference on CPUs: 
```
pip install deepsparse-nightly[llm]
```

Run in a [Python pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md):
```python
from deepsparse import TextGeneration
system_message = ""
prompt = "Who inspires you the most?"
formatted_prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant"
model = TextGeneration(model="hf:mgoin/OpenHermes-2.5-Mistral-7B-pruned50-quant-ds")
print(model(formatted_prompt, max_new_tokens=100).generations[0].text)
"""
That's a difficult question as there are many people who inspire me. However, one person who inspires me the most is my mother. She has shown me the importance of hard work, resilience, and perseverance. She has shown me how to overcome obstacles and how to be a strong and independent woman.
"""
```

## Prompt template: ChatML

```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

```

## Sparsification

For details on how this model was sparsified, see the `recipe.yaml` in this repo and follow the instructions below.

```bash
git clone https://github.com/neuralmagic/sparseml
pip install -e "sparseml[transformers]"
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py teknium/OpenHermes-2.5-Mistral-7B open_platypus --recipe recipe.yaml --save True
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment --sequence_length 4096
cp deployment/model.onnx deployment/model-orig.onnx
```

Run this kv-cache injection afterwards:
```python
import os
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
input_file = "deployment/model-orig.onnx"
output_file = "deployment/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")
```

## Slack

For further support, and discussions on these models and AI in general, join us at [Neural Magic's Slack server](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)