File size: 2,223 Bytes

fc52c65
1d2f912
 
 
 
 
 
 
 
 
 
fc52c65
c26ad1d
fc52c65
8cdcdac
fc52c65
c26ad1d
fc52c65
33e04b8
 
 
fc52c65
33e04b8
 
 
 
 
 
 
fc52c65
10eacbc
fc52c65
33e04b8
fc52c65
33e04b8
fc52c65
33e04b8
fc52c65
adc004d
 
33e04b8
 
fc52c65
a5e3e37
 
fc52c65
 
33e04b8
fc52c65
33e04b8
fc52c65
33e04b8
 
fc52c65
a5e3e37
c26ad1d
33e04b8
fc52c65
33e04b8
fc52c65
33e04b8
fc52c65
33e04b8
fc52c65
33e04b8
fc52c65
 
33e04b8


---
license: apache-2.0
datasets:
- oscar-corpus/OSCAR-2109
language:
- es
- en
pipeline_tag: text-generation
library_name: transformers
---

# B-GPT_es_en_simultaneous

This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on Spanish data. In the second half of training, the model was trained on a 50%-50% mix of Spanish and English data. At the end of training, 75% of training data seen by the model is Spanish and 25% is English. The tokenizer was trained on the same overall proportions of data as the language model at the final step. 

## Model details:

All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
Details for this model specifically:

* Architecture: gpt2
* Parameters: 124770816
* Maximum sequence length: 512 tokens
* Training tokens: 12B
* Vocabulary size: 50000
* Compute cost: ~9 NVIDIA A6000 GPU hours
* CO2 Emission: 1.17 kg

Training dataset: [OSCAR 2021/09](https://huggingface.co./datasets/oscar-corpus/OSCAR-2109)

Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.

## Use This Model

Load the model:

Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.

```
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_es_en_simultaneous")
model = AutoModel.from_pretrained("catherinearnett/B-GPT_es_en_simultaneous", revision = "128000")


````

Text Generation:

```
from transformers import pipeline

pipe = pipeline("text-generation", model="catherinearnett/B-GPT_es_en_simultaneous")
    
pipe("I am a")

```

## Citation

If you use this model, please cite:

```


```