You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Disclaimer: This model is still under testing and may change in the future, we will try to keep backwards compatibility. For any questions reach us at [email protected]

MediaWatch News Topics (Greek)

Fine-tuned model for multi-label text-classification (SequenceClassification), based on roberta-el-news, using Hugging Face's Transformers library. This model is to classify news in real-time on upto 33 topics including: AFFAIRS, AGRICULTURE, ARTS_AND_CULTURE, BREAKING_NEWS, BUSINESS, COVID, ECONOMY, EDUCATION, ELECTIONS, ENTERTAINMENT, ENVIRONMENT, FOOD, HEALTH, INTERNATIONAL, LAW_AND_ORDER, MILITARY, NON_PAPER, OPINION, POLITICS, REFUGEE, REGIONAL, RELIGION, SCIENCE, SOCIAL_MEDIA, SOCIETY, SPORTS, TECH, TOURISM, TRANSPORT, TRAVEL, WEATHER, CRIME, JUSTICE.

How to use

You can use this model directly with a pipeline for text-classification:

from transformers import pipeline

pipe = pipeline(
    task="text-classification", 
    model="cvcio/mediawatch-el-topics", 
    tokenizer="cvcio/roberta-el-news" # or cvcio/mediawatch-el-topics
)

topics = pipe(
    "Η βιασύνη αρκετών χωρών να άρουν τους περιορισμούς κατά του κορονοϊού, "+
    "αν όχι να κηρύξουν το τέλος της πανδημίας, με το σκεπτικό ότι έφτασε "+
    "πλέον η ώρα να συμβιώσουμε με την Covid-19, έχει κάνει μερικούς πιο "+
    "επιφυλακτικούς επιστήμονες να προειδοποιούν ότι πρόκειται μάλλον "+
    "για «ενδημική αυταπάτη» και ότι είναι πρόωρη τέτοια υπερβολική "+
    "χαλάρωση. Καθώς τα κρούσματα της Covid-19, μετά το αιφνιδιαστικό "+
    "μαζικό κύμα της παραλλαγής Όμικρον, εμφανίζουν τάση υποχώρησης σε "+
    "Ευρώπη και Βόρεια Αμερική, όπου περισσεύει η κόπωση μεταξύ των "+
    "πολιτών μετά από δύο χρόνια πανδημίας, ειδικοί και μη αδημονούν να "+
    "«ξεμπερδέψουν» με τον κορονοϊό.",
    padding=True,
    truncation=True,
    max_length=512,
    return_all_scores=True
)

print(topics)

# outputs 
[
  [
    {'label': 'AFFAIRS', 'score': 0.0018806682201102376}, 
    {'label': 'AGRICULTURE', 'score': 0.00014653144171461463}, 
    {'label': 'ARTS_AND_CULTURE', 'score': 0.0012948638759553432}, 
    {'label': 'BREAKING_NEWS', 'score': 0.0001729220530251041}, 
    {'label': 'BUSINESS', 'score': 0.0028276608791202307}, 
    {'label': 'COVID', 'score': 0.4407998025417328}, 
    {'label': 'ECONOMY', 'score': 0.039826102554798126}, 
    {'label': 'EDUCATION', 'score': 0.0019098613411188126}, 
    {'label': 'ELECTIONS', 'score': 0.0003333651984576136}, 
    {'label': 'ENTERTAINMENT', 'score': 0.004249618388712406}, 
    {'label': 'ENVIRONMENT', 'score': 0.0015828514005988836}, 
    {'label': 'FOOD', 'score': 0.0018390495097264647}, 
    {'label': 'HEALTH', 'score': 0.1204477995634079}, 
    {'label': 'INTERNATIONAL', 'score': 0.25892165303230286}, 
    {'label': 'LAW_AND_ORDER', 'score': 0.07646272331476212}, 
    {'label': 'MILITARY', 'score': 0.00033025629818439484}, 
    {'label': 'NON_PAPER', 'score': 0.011991199105978012}, 
    {'label': 'OPINION', 'score': 0.16166265308856964}, 
    {'label': 'POLITICS', 'score': 0.0008890336030162871}, 
    {'label': 'REFUGEE', 'score': 0.0011504743015393615}, 
    {'label': 'REGIONAL', 'score': 0.0008734092116355896}, 
    {'label': 'RELIGION', 'score': 0.0009001944563351572}, 
    {'label': 'SCIENCE', 'score': 0.05075162276625633}, 
    {'label': 'SOCIAL_MEDIA', 'score': 0.00039615994319319725}, 
    {'label': 'SOCIETY', 'score': 0.0043518817983567715}, 
    {'label': 'SPORTS', 'score': 0.002416545059531927}, 
    {'label': 'TECH', 'score': 0.0007818648009561002}, 
    {'label': 'TOURISM', 'score': 0.011870541609823704}, 
    {'label': 'TRANSPORT', 'score': 0.0009422845905646682}, 
    {'label': 'TRAVEL', 'score': 0.03004464879631996}, 
    {'label': 'WEATHER', 'score': 0.00040286066359840333}, 
    {'label': 'CRIME', 'score': 0.0005416403291746974}, 
    {'label': 'JUSTICE', 'score': 0.000990519649349153}
  ]
]

Labels

All labels, except NON_PAPER, retrieved by source articles during the data collection step, without any preprocessing, assuming that journalists and newsrooms assign correct tags to the articles. We disregarded all articles with more than 6 tags to reduce bias and tag manipulation.

label roc_auc samples
AFFAIRS 0.9872 6,314
AGRICULTURE 0.9799 1,254
ARTS_AND_CULTURE 0.9838 15,968
BREAKING_NEWS 0.9675 827
BUSINESS 0.9811 6,507
COVID 0.9620 50,000
CRIME 0.9885 34,421
ECONOMY 0.9765 45,474
EDUCATION 0.9865 10,111
ELECTIONS 0.9940 7,571
ENTERTAINMENT 0.9925 23,323
ENVIRONMENT 0.9847 23,060
FOOD 0.9934 3,712
HEALTH 0.9723 16,852
INTERNATIONAL 0.9624 50,000
JUSTICE 0.9862 4,860
LAW_AND_ORDER 0.9177 50,000
MILITARY 0.9838 6,536
NON_PAPER 0.9595 4,589
OPINION 0.9624 6,296
POLITICS 0.9773 50,000
REFUGEE 0.9949 4,536
REGIONAL 0.9520 50,000
RELIGION 0.9922 11,533
SCIENCE 0.9837 1,998
SOCIAL_MEDIA 0.991 6,212
SOCIETY 0.9439 50,000
SPORTS 0.9939 31,396
TECH 0.9923 8,225
TOURISM 0.9900 8,081
TRANSPORT 0.9879 3,211
TRAVEL 0.9832 4,638
WEATHER 0.9950 19,931
loss 0.0533 -
roc_auc 0.9855 -

Pretraining

The model was pretrained using an NVIDIA A10 GPU for 15 epochs (~ approx 59K steps, 8 hours training) with a batch size of 128. The optimizer used is Adam with a learning rate of 1e-5, and weight decay 0.01. We used roc_auc_micro to evaluate the results.

Framework versions

  • Transformers 4.13.0
  • Pytorch 1.9.0+cu111
  • Datasets 1.16.1
  • Tokenizers 0.10.3

Authors

Dimitris Papaevagelou - @andefined

About Us

Civic Information Office is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.

Downloads last month
0
Safetensors
Model size
125M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results