Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co./docs/hub/model-cards#model-card-metadata)

UPDATE: Official version is out, use it instead: https://huggingface.co./mistralai/Mistral-7B-v0.1





mistral-7B-v0.1-hf

Huggingface compatible version of Mistral's 7B model: https://twitter.com/MistralAI/status/1706877320844509405

Usage

Load in bfloat16 (16GB VRAM or higher)

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer

tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
    "kittn/mistral-7B-v0.1-hf",
    torch_dtype=torch.bfloat16,
    device_map={"": 0}
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)

Load in bitsandbytes nf4 (6GB VRAM or higher, maybe less with double_quant)

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer, BitsAndBytesConfig

tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
    "kittn/mistral-7B-v0.1-hf",
    device_map={"": 0},
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=False, # set to True to save more VRAM at the cost of some speed/accuracy
    ),
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)

Load in bitsandbytes int8 (8GB VRAM or higher). Quite slow; not recommended.

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer, BitsAndBytesConfig

tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
    "kittn/mistral-7B-v0.1-hf",
    device_map={"": 0},
    quantization_config=BitsAndBytesConfig(
        load_in_8bit=True,
    ),
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)

Notes

  • The original huggingface conversion script converts the model from bf16 to fp16 before saving it. This script doesn't
  • The tokenizer is created with legacy=False, more about this here
  • Saved in safetensors format

Conversion script [link]

Unlike meta-llama/Llama-2-7b, this model uses GQA. This breaks some assumptions in the original conversion script, requiring a few changes.

Conversion script: link

Original conversion script: link

Downloads last month
1,930
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for kittn/mistral-7B-v0.1-hf

Merges
1 model