readme: mention currently missing grad norm

c5e03ab about 2 months ago

3.7 kB

	---
	license: cc-by-sa-3.0
	language:
	- de
	---

	# xLSTM Model trained on German Wikipedia

	Research & development of an xLSTM model trained on German Wikipedia.

	The Flair team is currently working on the integration of xLSTM (both LM training and fine-tuning models for downstream tasks).

	For pretraining this xLSTM model, we this [fork](https://github.com/HallerPatrick/helibrunna) (from [Patrick Haller](https://huggingface.co./PatrickHaller)) of the awesome [Helibrunna](https://github.com/AI-Guru/helibrunna) library from [Tristan](https://huggingface.co./TristanBehrens).

	Initially, we integrated xLSTM model training into Flair - for more information about this, please refer to the archived [flair-old](https://huggingface.co./stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch of this repository.

	# Changelog

	- 28.08.2024: Model training is now done with [Helibrunna](https://github.com/AI-Guru/helibrunna) fork - find it [here](https://github.com/HallerPatrick/helibrunna).
	- 10.06.2024: Initial version. xLSTM was trained with Flair library, see this [old](https://huggingface.co./stefan-it/xlstm-german-wikipedia/blob/flair-old/README.md) branch.

	# Training

	The current model was trained with commit `f66cc55` from the [`main` branch](https://github.com/HallerPatrick/helibrunna) of the forked Helibrunna repo.

	The `xlstm` [library](https://github.com/NX-AI/xlstm) needs to be installed manually - also check that `pip3 install Ninja` is installed.

	The German Wikipedia dump from [this repository](https://huggingface.co./datasets/gwlms/dewiki-20230701-flair-corpus) is used.

	The following training configuration is used:

	```yaml
	description: "Train a wikipedia xLSTM"

	training:
	model_name: "german_wikipedia"
	batch_size: 10
	lr: 6e-4
	lr_warmup_steps: 4584
	lr_decay_until_steps: "auto"
	lr_decay_factor: 0.001
	weight_decay: 0.1
	amp_precision: bfloat16
	weight_precision: float32
	enable_mixed_precision: true
	num_epochs: 1
	output_dir: "./output"
	save_every_step: 2000
	log_every_step: 10
	generate_every_step: 5000
	wandb_project: "xlstm"
	gradient_clipping: "auto"
	# wandb_project: "lovecraftxlstm"

	model:
	num_blocks: 24
	embedding_dim: 768
	mlstm_block:
	mlstm:
	num_heads: 4
	slstm_block: {}
	slstm_at: []
	context_length: 512

	dataset:
	output_path: "./output/german-wikipedia-dataset"
	hugging_face_id: ["stefan-it/dewiki-20230701"]
	split: "train" # Also subsetting is possible: "train[:100000]"
	shuffle: False
	seed: 42

	tokenizer:
	type: "pretrained"
	pretrained_class: "LlamaTokenizer"
	pretrained_id: "meta-llama/Llama-2-7b-hf"
	```

	# Usage

	It is possible to use the model to generate some text:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name_or_path = "stefan-it/xlstm-german-wikipedia"

	model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
	tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

	input_ids = tokenizer.encode("Heute ist schönes Wetter in", return_tensors="pt")
	output = model.generate(input_ids, max_length=100, temperature=0.7, do_sample=True)
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

	print(generated_text)
	```

	# Caveats

	Notice: this model integration is heavily under development. And in the process of finding good hyper-parameters.
	Also downstream experiments are coming very soon.

	Unfortunately, there are nan's occuring in the training (after 7h 33m 14s of training on a single RTX 4090):

	![Training Loss](training-loss.png)

	This is very likely due to missing grad norm - which will be added soon with `Accelerator.clip_grad_norm_`.

	The uploaded model checkpoint is from 80k steps.