VictorSanh commited on
Commit
1e6e357
·
1 Parent(s): a454db2

comments about layer norms

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -109,7 +109,7 @@ More information needed
109
 
110
  # Training Details
111
 
112
- We closel follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks.
113
 
114
  The model is trained on the following data mixture of openly accessible English data:
115
 
@@ -129,9 +129,9 @@ The model is trained on the following data mixture of openly accessible English
129
  **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
130
 
131
 
132
- For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions.
133
 
134
- Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
135
 
136
  The training objective is the standard next token prediction.
137
 
 
109
 
110
  # Training Details
111
 
112
+ We closel follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
113
 
114
  The model is trained on the following data mixture of openly accessible English data:
115
 
 
129
  **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
130
 
131
 
132
+ For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
133
 
134
+ Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.
135
 
136
  The training objective is the standard next token prediction.
137