|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
|
|
|
|
**DistMerge_Llama-3.1-8B-Instruct** is a customized variant of |
|
the **VAGOsolutions/Llama-3.1-SauerkrautLM-8B-Instruct**, which is itself a spectrum fine-tuned version of |
|
**Llama-2.1-8B-Instruct**. This customization is achieved by learning the distribution of all normalization |
|
layer weights from both the original Llama model and its fine-tuned counterpart. A layer-conditional diffusion |
|
based weights generation model that enables sampling for performance enhancement by leveraging the |
|
learned distributions to optimize the merging process is used to generate the normalization layer of |
|
bedio/DistMerge_Llama-3.1-8B-Instruct |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
We trained a diffusion model to learn the distribution of the normalization layers to enable generation weights |
|
that improve the performance. |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. |
|
This model card has been automatically generated. |
|
|
|
- **Developed by:** DeepAuto.ai |
|
- **Shared by [optional]:** the model was shared by from deepauto.ai |
|
- **Model type:** DistMerge_Llama-3.1-8B-Instruct is a customized model by generating diverse weights for |
|
Llama-3.1-SauerkrautLM-8b-Instruct model |
|
- **Language(s) (NLP):** the base model was fine-tuned on German, English. We only use the one provided on Huggingface |
|
- **License:** llama3.1 |
|
Contact: DeepAuto.ai |
|
|
|
|
|
|
|
### Training Procedure |
|
|
|
We employed a latent diffusion process on pretrained model weights, unlocking the ability to generate diverse, |
|
previously unseen neural networks. Remarkably, even within the constraints of one-shot learning, our approach consistently |
|
produces a wide range of weight variations, each offering distinct performance characteristics. These generated weights |
|
not only open opportunities for weight averaging and model merging but also have the potential to significantly enhance |
|
model performance. Moreover, they enable the creation of task-specific weights, tailored to optimize performance for |
|
specialized applications. |
|
|
|
#### Preprocessing [optional] |
|
|
|
- We selected a set of layers and combined their pretrained weights, then trained a Variational Autoencoder (VAE) to encode these weights into the layer dimension. |
|
- We conditionally trained a diffusion model on this set of weights, allowing individual sampling of layer-specific weights. |
|
- All selected layers were encoded into a 1024-dimensional space. This model exclusively contained the sampled weights for layer normalization." |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** The pretrained weights used for training are orriginally in bflot16. |
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
[More Information Needed] |
|
|
|
## Evaluation |
|
We evaluate the reconstrution and sampling performance on Winogrande task using lm_eval tools |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
The primary objective of this weight generation process was to demonstrate that by learning only the distribution |
|
of few layers weights9normlaization layers in this case) in an 8-billion-parameter model, it is possible to significantly enhance the |
|
model's capabilities. Notably, this is achieved using a fraction of the computational resources and without the |
|
need for fine-tuning, showcasing the efficiency and potential of this approach. |