Llama3-8b-simorgh / README.md
xmanii's picture
Update README.md
00f4a22 verified
|
raw
history blame
2.48 kB

model-index:

  • name: xmanii/llama-3-8b-instruct-bnb-4bit-persian description: | Model Information

    Developed by: xmanii License: Apache-2.0 Finetuned from model: unsloth/llama-3-8b-instruct-bnb-4bit

    Model Description

    This LLaMA model was fine-tuned on a unique Persian dataset of Alpaca chat conversations, consisting of approximately 8,000 rows. Our training process utilized two H100 GPUs, completing in just under 1 hour. We leveraged the power of Unsloth and Hugging Face's TRL library to accelerate our training process by 2x.

    Unsloth Made with Love

    Open-Source Contribution

    This model is open-source, and we invite the community to use and build upon our work. The fine-tuned LLaMA model is designed to improve Persian conversation capabilities, and we hope it will contribute to the advancement of natural language processing in the Persian language.

    Using Adapters with Unsloth

    To run the model with adapters, you can use the following code:

    import torch
    from unsloth import FastLanguageModel
    from unsloth.chat_templates import get_chat_template
    
    model_save_path = "path to the download folder"  #the hugging face folder path pulled.
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_save_path,
        max_seq_length=4096,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(model)  # Enable native 2x faster inference
    
    tokenizer = get_chat_template(
        tokenizer,
        chat_template="llama-3",  # use the llama-3 template
        mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},  # mapping the messages.
    )
    
    messages = [{"from": "human", "value": "your prompt"}]#add your prompt here as human
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Must add for generation
        return_tensors="pt",
    ).to("cuda")
    
    outputs = model.generate(input_ids=inputs, max_new_tokens=2048, use_cache=True)
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(response)
    

    Future Work

    We are working on quantizing the models and bringing them to ollama.