cornelius's picture
Update README.md
14fa99b
|
raw
history blame
6.25 kB
metadata
license: cc-by-sa-4.0
language:
  - da
metrics:
  - accuracy
pipeline_tag: text-classification
tags:
  - partypress
  - political science
  - parties
  - press releases
widget:
  - text: >-
      I Politiken i dag beskrives det, hvordan en dansk tolk har overværet
      tortur af en irakisk fange i britisk varetægt. Forsvarsministeren må med
      det samme be- eller afkræfte, om denne historie er sand, siger Frank Aaen,
      forsvarsordfører for Enhedslisten. - Det har ikke alene historisk
      interesse. Vi er jo i krig i Afghanistan, og vi skal være sikre på, at
      danske soldater reagerer, hvis de ser, at fanger udsættes for tortur. -
      Derfor vil vi også hurtigt have at vide, om denne hændelse er rapporteret
      op i systemet og har medført en klar melding om, at der er en pligt til at
      stoppe eller som minimum rapportere om eksempler på brug af tortur.
      Spørgsmål stillet af Frank Aaen til forsvarsministeren:Vil
      forsvarsministeren be- eller afkræfte, om at en dansk tolk har overværet
      tortur af en irakisk fange i britisk varetægt, jævnfør Politiken den 30/10
      2010?

PARTYPRESS monolingual Denmark

Fine-tuned model, based on Maltehb/danish-bert-botxo. Used in Erfort et al. (2023), building on the PARTYPRESS database. For the downstream task of classyfing press releases from political parties into 23 unique policy areas we achieve a performance comparable to expert human coders.

Model description

The PARTYPRESS monolingual model builds on Maltehb/danish-bert-botxo but has a supervised component. This means, it was fine-tuned using texts labeled by humans. The labels indicate 23 different political issue categories derived from the Comparative Agendas Project (CAP):

Code Issue
1 Macroeconomics
2 Civil Rights
3 Health
4 Agriculture
5 Labor
6 Education
7 Environment
8 Energy
9 Immigration
10 Transportation
12 Law and Crime
13 Social Welfare
14 Housing
15 Domestic Commerce
16 Defense
17 Technology
18 Foreign Trade
19.1 International Affairs
19.2 European Union
20 Government Operations
23 Culture
98 Non-thematic
99 Other

Model variations

There are several monolingual models for different countries, and a multilingual model. The multilingual model can be easily extended to other languages, country contexts, or time periods by fine-tuning it with minimal additional labeled texts.

Intended uses & limitations

The main use of the model is for text classification of press releases from political parties. It may also be useful for other political texts.

The classification can then be used to measure which issues parties are discussing in their communication.

How to use

This model can be used directly with a pipeline for text classification:

>>> from transformers import pipeline
>>> tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
>>> partypress = pipeline("text-classification", model = "cornelius/partypress-monolingual-denmark", tokenizer = "cornelius/partypress-monolingual-denmark", **tokenizer_kwargs)
>>> partypress("Your text here.")

Limitations and bias

The model was trained with data from parties in Denmark. For use in other countries, the model may be further fine-tuned. Without further fine-tuning, the performance of the model may be lower.

The model may have biased predictions. We discuss some biases by country, party, and over time in the release paper for the PARTYPRESS database. For example, the performance is highest for press releases from Ireland (75%) and lowest for Poland (55%).

Training data

The PARTYPRESS multilingual model was fine-tuned with about 3,000 press releases from parties in Denmark. The press releases were labeled by two expert human coders.

For the training data of the underlying model, please refer to Maltehb/danish-bert-botxo

Training procedure

Preprocessing

For the preprocessing, please refer to Maltehb/danish-bert-botxo

Pretraining

For the pretraining, please refer to Maltehb/danish-bert-botxo

Fine-tuning

We fine-tuned the model using about 3,000 labeled press releases from political parties in Denmark.

Training Hyperparameters

The batch size for training was 12, for testing 2, with four epochs. All other hyperparameters were the standard from the transformers library.

Framework versions

  • Transformers 4.28.0
  • TensorFlow 2.12.0
  • Datasets 2.12.0
  • Tokenizers 0.13.3

Evaluation results

Fine-tuned on our downstream task, this model achieves the following results in a five-fold cross validation that are comparable to the performance of our expert human coders. Please refer to Erfort et al. (2023)

BibTeX entry and citation info

@article{erfort_partypress_2023,
  author    = {Cornelius Erfort and
               Lukas F. Stoetzer and
               Heike Klüver},
  title     = {The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases},
  journal   = {Research and Politics},
  volume    = {forthcoming},
  year      = {2023},
}

Further resources

Github: cornelius-erfort/partypress

Research and Politics Dataverse: Replication Data for: The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases

Acknowledgements

Research for this contribution is part of the Cluster of Excellence "Contestations of the Liberal Script" (EXC 2055, Project-ID: 390715649), funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Denmark´s Excellence Strategy. Cornelius Erfort is moreover grateful for generous funding provided by the DFG through the Research Training Group DYNAMICS (GRK 2458/1).

Contact

Cornelius Erfort

Humboldt-Universität zu Berlin

corneliuserfort.de