Model Card for DhruvParth/Mistral-7B-Instruct-v2.0-PairRM-DPO

This model is a fine-tuned version of the Mistral-7B model, utilizing Direct Preference Optimization (DPO) to better align the model's responses with human preferences, specifically in a causal language modeling context.

Model Details

Model Description

  • Developed by: Dhruv Parthasarathy
  • Model type: Fine-tuned language model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: Mistral-7B-Instruct-v2.0

Model Sources

Uses

This model is tailored for scenarios requiring alignment with human preferences in automated responses, suitable for applications in personalized chatbots, customer support, and other interactive services.

Training Details

Notebook

The fine-tuning process and the experiments were documented in a Jupyter Notebook, available here.

Training Configuration

LoRA Configuration

LoraConfig(
    r=8,
    lora_alpha=8,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'v_proj', 'q_proj', 'dense']
)

BitsAndBytes Configuration

BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

Training Device Setup

device_map = {"": 0}

Training Arguments

DPOConfig(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=50,
    save_strategy="no",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=5,
)

DPO Trainer Setup

DPOTrainer(
    model,
    args=training_args,
    train_dataset=updated_train_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=512,
    max_length=1024,
)

Evaluation

Details on the model's performance, evaluation protocols, and results will be provided as they become available.

Citation

If you use this model or dataset, please cite it as follows:

BibTeX:

@misc{dhruvparth_mistral7b_dpo_2024,
  author = {Dhruv Parthasarathy},
  title = {Fine-tuning LLMs with Direct Preference Optimization},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://huggingface.co./DhruvParth/Mistral-7B-Instruct-v2.0-PairRM-DPO}
}

APA: Dhruv Parthasarathy. (2024). Fine-tuning LLMs with Direct Preference Optimization. GitHub repository, https://huggingface.co./DhruvParth/Mistral-7B-Instruct-v2.0-PairRM-DPO

For any queries or discussions regarding the project, please open an issue in the GitHub repository, post your comment in the community section, reach out to me via LinkedIn (https://www.linkedin.com/in/parthadhruv/) or contact me directly at [email protected].

Downloads last month
8
Safetensors
Model size
7.24B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train DhruvParth/Mistral-7B-Instruct-v0.2-DPO-v0.1