yguo262 commited on
Commit
950d211
·
1 Parent(s): 54f1db1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -1,10 +1,10 @@
1
  # SocBERT model
2
  Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective.
3
- The tweets are from [Archive](https://archive.org/details/twitterstream) and collected from Twitter Streaming API.
4
- The Reddit comments are ramdonly sampled from all subreddits from 2015-2019.
5
- The model was trained from scratch following the model architecture of RoBERTa-base.
 
6
  We benchmarked SocBERT, on 40 text classification tasks with social media data.
7
- The model was pre-trained on 160M sequence blocks for 950K steps of which the maximum sequence length is 128.
8
 
9
  The experiment results can be found in our paper:
10
  ```
 
1
  # SocBERT model
2
  Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective.
3
+ The tweets are from Archive and collected from Twitter Streaming API.
4
+ The Reddit comments are ramdonly sampled from all subreddits from 2015-2019.
5
+ SocBERT-base was pretrained on 819M sequence blocks for 100K steps.
6
+ SocBERT-final was pretrained on 929M (819M+110M) sequence blocks for 112K (100K+12K) steps.
7
  We benchmarked SocBERT, on 40 text classification tasks with social media data.
 
8
 
9
  The experiment results can be found in our paper:
10
  ```