justinqbui's picture
Create README.md
23b3273
|
raw
history blame
4.22 kB
metadata
tags: null
model-index:
  - name: bertweet-covid--vaccine-tweets-finetuned
    results: []

bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets

This model is a fine-tuned version of justinqbui/bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets which was finetuned by using this google fact check ~3k dataset size and webscraped data from polifact covid info ~ 1200 dataset size and ~1200 tweets pulled from the CDC with tweets containing the words covid or vaccine. It achieves the following results on the evaluation set (20% from the dataset randomly shuffled and selected to serve as a test set):

  • Validation Loss: 0.246620
  • Accuracy: 0.902417%

To use the model, use the inference API.

Alternatively, to run locally

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("justinqbui/bertweet-pretraining-covid-vaccine-tweets-finetuned")

model = AutoModelForSequenceClassification.from_pretrained("justinqbui/bertweet-pretraining-covid-vaccine-tweets-finetuned")

Model description

This model is a fine-tuned version of pretrained version justinqbui/bertweet-covid19-base-uncased-pretraining-covid-vaccine-tweets. Click on this to see how the pre-training was done.

This model was fine-tuned with a dataset of ~5500. A web scraper was used to scrape polifact and a script was used to pull from the google fact check API. Because ~80% of both these datasets were either false or misleading, I pulled about ~1200 tweets from the CDC related to covid and labelled them as true. ~30% of this dataset is considered true and the rest false or misleading. Please see the published datasets above for more detailed information.

The tokenizer requires the emoji library to be installed.

!pip install nltk emoji

Intended uses & limitations

The intended use of this model is to detect if the contents of a covid tweet is potentially false or misleading. This model is not an end all be all. It has many limitations. For example, if someone makes a post containing an image, but has attached a satirical image, this model would not be able to distinguish this. If a user links a website, the tokenizer allocates a special token for links, meaning the contents of the linked website is completely lost. If someone tweets a reply, this model can't look at the parent tweets, and will lack context.

This model's dataset relies on the crowd-sourcing annotations being accurate.

Training and evaluation data

This model was finetuned by using this google fact check ~3k dataset size and webscraped data from polifact covid info ~ 1200 dataset size and ~1200 tweets pulled from the CDC with tweets containing the words covid or vaccine.

Training hyperparameters

The following hyperparameters were used during training: - learning_rate: 2e-5 - train_batch_size: 128 - eval_batch_size: 128 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3.0

Training results

Training Loss Epoch Validation Loss Accuracy
0.435500 1.0 0.401900 0.906893
0.309700 2.0 0.265500 0.907789
0.266200 3.0 0.216500 0.911370

Framework versions

  • Transformers 4.13.0
  • Pytorch 1.10.0+cu111
  • Datasets 1.16.1
  • Tokenizers 0.10.3