bert-base-multilingual-cased-finetuned-yiddish-experiment-3

This model is a fine-tuned version of bert-base-multilingual-cased on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.4254

Model description

More information needed

Intended uses & limitations

Intended for use with a chatbot to correct raw Yiddish machine transcriptions, which have been generated by Transkribus

Training and evaluation data

Training dataset = Gavin model fine tuning_lines.csv

Training procedure

The training process described in Experiment 3 focuses on fine-tuning the pre-trained mBERT (multilingual BERT) model for improving raw handwritten text recognition (HTR). The fine-tuning dataset consists of raw HTR outputs paired with their human-corrected ground truth, as indicated in the line.csv file.

Key Parameters and Rationale:

1. Model Selection: The use of bert-base-multilingual-cased leverages the multilingual capabilities of BERT to accommodate the linguistic diversity likely present in the handwritten text dataset. This choice aligns well with the need to handle potentially mixed-language inputs or varying character distributions.

2. Data Handling:

  • The dataset is loaded and structured into columns for raw HTR text and its hand-corrected counterpart.
  1. Tokenization is performed using the mBERT tokenizer, with a maximum sequence length of 64 tokens. This length balances capturing sufficient context while preventing memory overhead.

3. Training Configuration:

  • Batch Size and Gradient Accumulation: A batch size of 4 with a gradient accumulation step of 1 is chosen, likely due to the memory limitations of the L4 GPU, ensuring stable training while processing smaller data chunks.

  • Learning Rate and Weight Decay: A low learning rate of 5e-6 allows for gradual updates to the pre-trained weights, preserving the pre-trained linguistic knowledge while adapting to the new task. Weight decay is set to 0 to avoid penalizing model parameters unnecessarily for this specific task.

  • Gradient Clipping: The maximum gradient norm of 1 prevents exploding gradients, which could destabilize training given the small batch size and high learning rate sensitivity.

  • Warm-Up Steps: 300 warm-up steps allow the optimizer to start with smaller updates, reducing initial instability.

  • Epochs and Logging: The model is trained for 10 epochs with evaluation loss logged every 100 steps, providing a balance between sufficient training time and monitoring. Compute Setup:

The process was executed on an L4 GPU, which is optimized for such NLP workloads, providing efficient computation and faster training iterations.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 300
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss
11.143 0.2364 100 7.6591
4.1737 0.4728 200 2.2642
2.0579 0.7092 300 1.7710
1.6963 0.9456 400 1.6712
1.5705 1.1820 500 1.6379
1.5353 1.4184 600 1.6003
1.5213 1.6548 700 1.5273
1.4387 1.8913 800 1.5415
1.3973 2.1277 900 1.5530
1.4266 2.3641 1000 1.5328
1.3365 2.6005 1100 1.5154
1.4423 2.8369 1200 1.4662
1.3948 3.0733 1300 1.5041
1.3244 3.3097 1400 1.4530
1.3645 3.5461 1500 1.4656
1.329 3.7825 1600 1.4542
1.3326 4.0189 1700 1.5293
1.2768 4.2553 1800 1.4575
1.3125 4.4917 1900 1.4638
1.2925 4.7281 2000 1.4867
1.281 4.9645 2100 1.4827
1.2966 5.2009 2200 1.4359
1.28 5.4374 2300 1.4761
1.2436 5.6738 2400 1.5006
1.2787 5.9102 2500 1.4511
1.2344 6.1466 2600 1.4430
1.199 6.3830 2700 1.4254
1.2899 6.6194 2800 1.4339
1.2637 6.8558 2900 1.4609
1.2186 7.0922 3000 1.4300
1.181 7.3286 3100 1.4407
1.2815 7.5650 3200 1.4471
1.2161 7.8014 3300 1.4413
1.1562 8.0378 3400 1.4695
1.1668 8.2742 3500 1.4940
1.2557 8.5106 3600 1.4430
1.1985 8.7470 3700 1.4562
1.2051 8.9835 3800 1.4412
1.1588 9.2199 3900 1.4421
1.2002 9.4563 4000 1.4477
1.2339 9.6927 4100 1.4573
1.1918 9.9291 4200 1.4463

Framework versions

  • Transformers 4.47.0
  • Pytorch 2.5.1+cu121
  • Datasets 3.1.0
  • Tokenizers 0.21.0
Downloads last month
2
Safetensors
Model size
178M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MarineLives/mBert-finetuned-yiddish-experiment-3

Finetuned
(626)
this model