metadata

language:
  - en
  - de
license: apache-2.0
library_name: transformers
datasets:
  - FreedomIntelligence/sharegpt-deutsch
  - mayflowergmbh/oasst_de
  - mayflowergmbh/dolly_15k_de
  - mayflowergmbh/openschnabeltier_de
  - mayflowergmbh/ultrachat_de
  - WizardLM/WizardLM_evol_instruct_V2_196k
  - mayflowergmbh/evol_instruct_de
  - mayflowergmbh/alpaca-gpt4_de
  - mayflowergmbh/dolphin_de
  - mayflowergmbh/airoboros_de
pipeline-tag: text-generation
model-index:
  - name: ende-chat-0.0.7
    results: []

Model Card for EnDe-chat-0.0.7

Preliminary LoRA finetune of Mistral-7B for German and English quality text.

This version has an extended tokenizer, to make the model able to handle longer input.

This is an experiment to improve the German capabilities of Mistral with continued finetuning. The finetuning also includes English data, in order to retain the English capabilities, to allow the model to be used for translation and for answering German questions on English documents and vice versa.

Unfortunately, the compute available for this experiment (2xV100) was not at all sufficient for the amount of training data we would have liked to include.

After continued pretraining, this model has received instruction finetuning.

Model Details
- Model Description
Uses
- Out-of-Scope Use
Bias, Risks, and Limitations
- Recommendations
Training Details
- Training Data
- Training Procedure
Evaluation
Examples

Model Details

Model Description

LoRA finetune of Mistral-7B for German and English quality text.

Developed by: Erich Schubert
Model type: Language model
Language(s) (NLP): deu, eng
License: apache-2.0
Parent Model: mistralai/Mistral-7B-v0.1
Resources for more information: n/a

Uses

Model finetuned for chat in German and English.

Out-of-Scope Use

The model has not received alignment or instruction finetuning, this is intended as a chat foundation model.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Further finetuning necessary!

Training Details

Training Data

Pretrained on proprietary text collected from the internet, with a focus on quality German and English text.

Typical benchmarking data should not be present in this data set.

This is no longer as clear for the finetuning data sets, but the amount of data and compute for instruction tuning was much less.

Training Procedure

Initial LoRA finetuning with LLaMA-Factory using a mixture of English and German data, with a focus on data quality.

Unfortunately, I could use 100x as much GPU power as I had available for this experiment, and had to heavily subsample the data. As is, this is largely a proof of concept to see if we can improve model quality with better data.

This version then received basic chat/instruction training with

    --stage sft \
    --model_name_or_path ende-0.0.7 \
    --finetuning_type lora \
    --template default \
    --dataset_dir data \
    --dataset sharegpt-deutsch,oasst_de,dolly_15k_de,openschnabeltier_de,ultrachat_de,evol_instruct,evol_instruct_de,alpaca-gpt4_de,dolphin_de,airoboros_de \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 1.0 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --neftune_noise_alpha 0 \
    --lora_target all \
    --lora_rank 8 \
    --lora_dropout 0 \
    --fp16 True \

Unfortunately, most of this fine-tuning data is just automatically translated from English. I do not think this leads to particularly high-quality data.

Evaluation

Not fully evaluated, as it has not been completely trained.

Also, I believe that our benchmarks tend to be misleading. In particular the huggingface leaderboard is flooded with overfitted models with little to no value. Real-world performance may be task specific and needs to be evaluated carefully on a case basis. I hope some will find this model to be useful!

You are welcome to contribute evaluation scores!

kno10
/

ende-chat-0.0.7