XLM-RoBERTa Token Classification for Named Entity Recognition (NER)

Model Description

This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:

PER: Person names

ORG: Organization names

LOC: Location names

Uses

This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.

Applications: Information extraction Multilingual NER tasks Automated text analysis for businesses

Training Details

Base Model: xlm-roberta-base

Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.

Training Framework: Hugging Face transformers library with PyTorch backend.

Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.

Training Procedure

Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:

Setup Environment:

Clone the repository and set up dependencies.

Import necessary libraries and modules.

Load Data:

Load the PAN-X subset from the XTREME dataset.

Shuffle and sample data subsets for training and evaluation.

Data Preparation:

Convert raw dataset into a format suitable for token classification.

Define a mapping for entity tags and apply tokenization.

Align NER tags with tokenized inputs.

Define Model:

Initialize the XLM-RoBERTa model for token classification.

Configure the model with the number of labels based on the dataset.

Setup Training Arguments:

Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.

Configure logging and checkpointing.

Initialize Trainer:

Create a Trainer instance with the model, training arguments, datasets, and data collator.

Specify evaluation metrics to monitor performance.

Train the Model:

Start the training process using the Trainer.

Monitor training progress and metrics.

Evaluation and Results:

Evaluate the model on the validation set.

Compute metrics like F1 score for performance assessment.

Save and Push Model:

Save the fine-tuned model locally or push to a model hub for sharing and further use.

Training Hyperparameters

The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.

Evaluation

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd

model_checkpoint = "MassMin/Multilingual-NER-tagging"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)

def tag_text_with_pipeline(text, ner_pipeline):
    # Use the NER pipeline to get predictions
    results = ner_pipeline(text)
    
    # Convert results to a DataFrame for easy viewing
    df = pd.DataFrame(results)
    df = df[['word', 'entity', 'score']]
    df.columns = ['Tokens', 'Tags', 'Score']  # Rename columns for clarity
    return df

text = "2000 Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)







#### Testing Data








    0	1	2	3	4	5	6	7	8	9	10	11
Tokens	2.000	Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	.
Tags	O	O	O	O	B-LOC	I-LOC	O	O	B-LOC	B-LOC	I-LOC	O