Fine-Tuning Pre-Trained Model for English and Albanian

This project demonstrates the process of fine-tuning a pre-trained model for language tasks in both English and Albanian. We utilize transfer learning with a pre-trained model (e.g., BERT or multilingual BERT) to adapt it for specific tasks in these two languages, such as text classification, named entity recognition (NER), or sentiment analysis.

Requirements

Prerequisites

  • Python 3.7+
  • TensorFlow or PyTorch
  • Hugging Face Transformers library
  • CUDA-enabled GPU (recommended for faster training)

Dependencies

Install the following Python libraries using pip:

pip install torch transformers datasets
pip install tensorflow  # If using TensorFlow
pip install tqdm
pip install scikit-learn
Model Overview
We fine-tuned a pre-trained multilingual model (e.g., BERT Multilingual, mBERT, or XLM-RoBERTa) to perform NLP tasks in both English and Albanian. These models are pre-trained on multiple languages, including English and Albanian, and are then fine-tuned on a custom dataset tailored to your task.

Example Pre-Trained Models:
bert-base-multilingual-cased
xlm-roberta-base
Fine-Tuning Process
1. Load the Pre-Trained Model and Tokenizer
python
Copy code
from transformers import BertTokenizer, BertForSequenceClassification

# Load the pre-trained multilingual model
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Adjust num_labels based on your task
2. Prepare the Dataset
You can fine-tune the model on your own dataset (in English and Albanian) using Hugging Face’s datasets library, or prepare your own dataset in CSV or JSON format.

Example:

python
Copy code
from datasets import load_dataset

# Load the dataset (replace with your own dataset)
dataset = load_dataset('csv', data_files='path_to_your_data.csv')
3. Preprocess the Data
Use the tokenizer to preprocess the dataset, converting text into token IDs compatible with the pre-trained model.

python
Copy code
def preprocess_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

# Apply preprocessing
tokenized_datasets = dataset.map(preprocess_function, batched=True)
4. Fine-Tuning the Model
Train the model on your dataset using either PyTorch or TensorFlow. Here's an example using PyTorch:

python
Copy code
from torch.utils.data import DataLoader
from transformers import AdamW

# Set training parameters
train_dataset = tokenized_datasets['train']
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Set optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
model.train()
for epoch in range(3):
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch}, Loss: {loss.item()}")
5. Evaluate the Model
After training, evaluate the model’s performance using the validation or test dataset.

python
Copy code
from sklearn.metrics import accuracy_score

model.eval()
# Example evaluation loop
predictions = []
labels = []
for batch in eval_dataloader:
    with torch.no_grad():
        input_ids = batch['input_ids'].to(device)
        labels.append(batch['labels'].numpy())
        outputs = model(input_ids)
        preds = torch.argmax(outputs.logits, dim=-1)
        predictions.append(preds.numpy())

accuracy = accuracy_score(labels, predictions)
print(f"Accuracy: {accuracy}")
Languages Supported
English: The model is fine-tuned on English text for the task at hand (e.g., text classification, sentiment analysis, etc.).
Albanian: The same model can be used for Albanian text, leveraging multilingual pre-trained weights. The performance may vary depending on the dataset, but mBERT and XLM-R are known to perform well for Albanian.
Results
This fine-tuned model provides state-of-the-art performance on both English and Albanian tasks. Results on the validation/test set should demonstrate good generalization across these two languages.

Example Results:

Accuracy: 85% on English dataset
Accuracy: 80% on Albanian dataset
Conclusion
By fine-tuning a pre-trained multilingual model, we significantly reduce the time and computational resources required for training a model from scratch. This approach leverages transfer learning, where the model has already learned general linguistic patterns from a wide variety of languages, allowing it to adapt to specific tasks in both English and Albanian.

License
This project is licensed under the MIT License - see the LICENSE file for details.


Downloads last month
0
Inference API
Unable to determine this model’s pipeline type. Check the docs .