maxpe
/

twitter-roberta-base_semeval18_emodetection

Text Classification

Inference Endpoints

Model card Files Files and versions Community

maxpe commited on Aug 12, 2021

Commit

fc4d00d

·

1 Parent(s): 11c2b2b

added README

Files changed (1) hide show

README.md +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+# Twitter-roBERTa-base
+This is a Twitter-roBERTa-base model trained on ~7000 tweets annotated for 11 emotion categories in [SemEval-2018 Task 1: Affect in Tweets: SubTask 5: Emotion Classification.](https://competitions.codalab.org/competitions/17751).
+Run the example script below like that.
+```bash
+python3 predict_11emoclasses.py testfile
+```
+```python
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Created on Wed Aug  4 17:56:24 2021
+@author: maxpe
+"""
+import transformers
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoConfig
+import torch
+from tqdm import tqdm
+from torch import cuda
+import pandas as pd
+import sys
+# choose GPU when available
+device = 'cuda' if cuda.is_available() else 'cpu'
+file=sys.argv[1]
+class RobertaClass(torch.nn.Module):
+    def __init__(self):
+        super(RobertaClass, self).__init__()
+        self.l1 = transformers.RobertaModel.from_pretrained("cardiffnlp/twitter-roberta-base")
+        self.l2 = torch.nn.Dropout(0.3)
+        self.l3 = torch.nn.Linear(768, 11)
+    def forward(self, ids, mask):
+        _, output_1= self.l1(ids, attention_mask = mask)
+        output_2 = self.l2(output_1)
+        output = self.l3(output_2)
+        return output
+model=transformers.AutoModel.from_pretrained("maxpe/twitter-roberta-base_semeval18_emodetection")
+model.config=transformers.RobertaConfig.from_pretrained("cardiffnlp/twitter-roberta-base")
+model.eval() # set model to eval mode
+model = torch.nn.DataParallel(model)
+model.to(device)
+tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base",model_max_length=512)
+dataset = load_dataset('text', data_files={'test': file})
+dataset = dataset.map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)
+dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
+# Make this smaller when you get a memory error
+BATCH_SIZE=32
+dataloader = torch.utils.data.DataLoader(dataset['test'], batch_size=BATCH_SIZE)
+open(file+"_11emo","w").close()
+with torch.no_grad():
+    # exchange the commented lines if you want to have a progress manager
+    # for _, data in tqdm(enumerate(dataloader, 0),total=len(dataloader)):
+    for _, data in enumerate(dataloader, 0):
+        outputs = model(data['input_ids'],data['attention_mask'])
+        fin_outputs=torch.sigmoid(outputs).tolist()
+        pd.DataFrame(fin_outputs).to_csv(file+"_11emo",index=False,header=False,sep="\t",mode='a')
+```