pablo-rf commited on
Commit
2633116
1 Parent(s): 6070c18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -84,13 +84,13 @@ widget:
84
  example_title: O neno
85
  ---
86
 
87
- # FLOR-1.3B-GL
88
 
89
  ## Table of Contents
90
  <details>
91
  <summary>Click to expand</summary>
92
 
93
- - [FLOR-1.3B-GL](#flor-13b-gl)
94
  - [Table of Contents](#table-of-contents)
95
  - [Model description](#model-description)
96
  - [Intended uses and limitations](#intended-uses-and-limitations)
@@ -113,12 +113,12 @@ widget:
113
 
114
  ## Model description
115
 
116
- **FLOR-1.3B-GL** is a 1.3B-parameter transformer-based causal language model for Galician.
117
  It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
118
 
119
  ## Intended uses and limitations
120
 
121
- The **FLOR-1.3B-GL** model is ready-to-use only for causal language modeling.
122
  It can perform text-generation tasks and be fine-tuned for specific scenarios.
123
 
124
  ## How to use
@@ -128,7 +128,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
128
 
129
  input_text = "Hoxe fai un bo día. O sol "
130
 
131
- model_id = "proxectonos/FLOR-1.3B-GL"
132
  tokenizer = AutoTokenizer.from_pretrained(model_id)
133
  model = AutoModelForCausalLM.from_pretrained(model_id)
134
  generator = pipeline(
@@ -157,10 +157,10 @@ It was trained using HuggingFace Transformers and Pytorch, using the [Causal Mod
157
 
158
  ### Language adaptation and training
159
 
160
- The language adaptation technique used to train FLOR-1.3B-GL is based in the used to train FLOR-1.3B, which is explained by their authors in this [Medium Post](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac). In summary, we proceeded as follows:
161
  1) We trained our own BPE tokenizer for galician and replaced the original FLOR-1.3B tokenizer and vocabulary with it.
162
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
163
- 3) The embeddings from tokens not present in FLOR-1.3-GL's original vocabulary were initialized as the average of all embeddings.
164
  4) The model was initialized with the weights from FLOR-1.3B and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
165
  5) The model was then trained on a galician corpus.
166
 
 
84
  example_title: O neno
85
  ---
86
 
87
+ # Carballo-bloom-1.3B
88
 
89
  ## Table of Contents
90
  <details>
91
  <summary>Click to expand</summary>
92
 
93
+ - [Carballo-bloom-1.3B](#carballo-bloom-13)
94
  - [Table of Contents](#table-of-contents)
95
  - [Model description](#model-description)
96
  - [Intended uses and limitations](#intended-uses-and-limitations)
 
113
 
114
  ## Model description
115
 
116
+ **Carballo-bloom-1.3B** is a 1.3B-parameter transformer-based causal language model for Galician.
117
  It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
118
 
119
  ## Intended uses and limitations
120
 
121
+ The **Carballo-bloom-1.3B** model is ready-to-use only for causal language modeling.
122
  It can perform text-generation tasks and be fine-tuned for specific scenarios.
123
 
124
  ## How to use
 
128
 
129
  input_text = "Hoxe fai un bo día. O sol "
130
 
131
+ model_id = "proxectonos/Carballo-bloom-1.3B"
132
  tokenizer = AutoTokenizer.from_pretrained(model_id)
133
  model = AutoModelForCausalLM.from_pretrained(model_id)
134
  generator = pipeline(
 
157
 
158
  ### Language adaptation and training
159
 
160
+ The language adaptation technique used to train Carballo-bloom-1.3B is based in the used to train FLOR-1.3B, which is explained by their authors in this [Medium Post](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac). In summary, we proceeded as follows:
161
  1) We trained our own BPE tokenizer for galician and replaced the original FLOR-1.3B tokenizer and vocabulary with it.
162
  2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
163
+ 3) The embeddings from tokens not present in Carballo-bloom-1.3B's original vocabulary were initialized as the average of all embeddings.
164
  4) The model was initialized with the weights from FLOR-1.3B and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
165
  5) The model was then trained on a galician corpus.
166