metadata
datasets:
- imdb
- cornell_movie_dialogue
- polarity_movie_data
- 25mlens_movie_data
language:
- English
thumbnail: null
tags:
- roberta
- roberta-base
- masked-language-modeling
- masked-lm
license: cc-by-4.0
roberta-base for MLM
Objective: To make a Roberta Base for the Movie Domain by using various Movie Datasets as simple text for Masked Language Modeling. This is the Movie Roberta to be used in Movie Domain applications.
model_name = "thatdramebaazguy/movie-roberta-base"
pipeline(model=model_name, tokenizer=model_name, revision="v1.0", task="Fill-Mask")
Overview
Language model: roberta-base
Language: English
Downstream-task: Fill-Mask
Training data: imdb, polarity movie data, cornell_movie_dialogue, 25mlens movie names
Eval data: imdb, polarity movie data, cornell_movie_dialogue, 25mlens movie names
Infrastructure: 4x Tesla v100
Code: See example
Hyperparameters
Num examples = 4767233
Num Epochs = 2
Instantaneous batch size per device = 20
Total train batch size (w. parallel, distributed & accumulation) = 80
Gradient Accumulation steps = 1
Total optimization steps = 119182
eval_loss = 1.6153
eval_samples = 20573
perplexity = 5.0296
learning_rate=5e-05
n_gpu = 4
Performance
perplexity = 5.0296
Some of my work: