UrduClassification / README.md
mwz's picture
Librarian Bot: Add base_model information to model (#3)
ccaf04b
|
raw
history blame
4.38 kB
metadata
language:
  - ur
license: mit
tags:
  - generated_from_trainer
datasets:
  - imdb_urdu_reviews
widget:
  - text: >-
      میں نے یہ فلم دیکھنے کے لئے بہت احتیاط کی تھی، لیکن اس کی کہانی اور
      اداکاری نے میری توقعات کو پورا کیا۔ بالکل شاندار فلم!
    example_title: Positive Example 1
  - text: >-
      اس فلم کی کہانی بہت بے معنی اور بے چارہ ہے۔ میں نے اپنا وقت اور پیسہ برباد
      کر دیا۔ براہ کرم اس سے بچیں!
    example_title: Negative Example 1
  - text: >-
      یہ ناقابل فہم فلم ہے۔ کوئی بھی اسے دیکھ کر توڑ دل ہو جائے گا۔ بلکل بے
      فائدہ!
    example_title: Negative Example 2
  - text: >-
      میں نے ہمیشہ کی طرح اس فلم کو بھی بہت مزہ دیا۔ اداکاری، کہانی، اور
      ڈائریکشن سب بہترین تھی۔ دل کھول کر تصویر دیکھنے کا موقع!
    example_title: Positive Example 2
  - text: >-
      اس فلم میں اتنی بے وقوفی دکھائی گئی ہے کہ آپ بھی اپنے دماغ کو چیک کریں گے۔
      بلکل بکواس!
    example_title: Negative Example 3
base_model: urduhack/roberta-urdu-small
model-index:
  - name: UrduClassification
    results: []

UrduClassification

This model is a fine-tuned version of urduhack/roberta-urdu-small on the imdb_urdu_reviews dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4703

Model Details

  • Model Name: Urdu Sentiment Classification
  • Model Architecture: RobertaForSequenceClassification
  • Base Model: urduhack/roberta-urdu-small
  • Dataset: IMDB Urdu Reviews
  • Task: Sentiment Classification (Positive/Negative)

Training Procedure

The model was fine-tuned using the transformers library and the Trainer class from Hugging Face. The training process involved the following steps:

  1. Tokenization: The input Urdu text was tokenized using the RobertaTokenizerFast from the "urduhack/roberta-urdu-small" pre-trained model. The texts were padded and truncated to a maximum length of 256 tokens.

  2. Model Architecture: The "urduhack/roberta-urdu-small" pre-trained model was loaded as the base model for sequence classification using the RobertaForSequenceClassification class.

  3. Training Arguments: The training arguments were set, including the number of training epochs, batch size, learning rate, evaluation strategy, logging strategy, and more.

  4. Training: The model was trained on the training dataset using the Trainer class. The training process was performed with gradient-based optimization techniques to minimize the cross-entropy loss between predicted and actual sentiment labels.

  5. Evaluation: After each epoch, the model was evaluated on the validation dataset to monitor its performance. The evaluation results, including training loss and validation loss, were logged for analysis.

  6. Fine-Tuning: The model parameters were fine-tuned during the training process to optimize its performance on the IMDb Urdu movie reviews sentiment analysis task.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
0.4078 1.0 2500 0.3954
0.2633 2.0 5000 0.4007
0.1205 3.0 7500 0.4703

Evaluation Results

The model was evaluated on an undisclosed dataset using a language modeling task. The evaluation results after 3 epochs of fine-tuning are as follows:

  • Evaluation Loss: 0.3954
  • Evaluation Runtime: 51.60 seconds
  • Average Samples per Second: 96.89
  • Average Steps per Second: 6.06
  • Epoch: 3.0

Framework versions

  • Transformers 4.30.2
  • Pytorch 2.0.0
  • Datasets 2.1.0
  • Tokenizers 0.13.3