SpanMarker
This is a SpanMarker model that can be used for Named Entity Recognition. It was trained on the Legal NER Indian Justice dataset.
Official repository of the model: Github Link
Model Details
Model Description
- Model Type: SpanMarker
- Maximum Sequence Length: 128 tokens
- Maximum Entity Length: 6 words
Model Sources
Repository: SpanMarker on GitHub
Thesis: SpanMarker For Named Entity Recognition
|
Uses
Direct Use for Inference
from span_marker import SpanMarkerModel
from span_marker.tokenizer import SpanMarkerTokenizer
# Download from the ๐ค Hub
model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-legal")
tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.config)
model.set_tokenizer(tokenizer)
# Run inference
entities = model.predict("The petition was filed through Sh. Vijay Pahwa, General Power of Attorney and it was asserted in the petition under Section 13-B of the Rent Act that 1 of 23 50% share of the demised premises had been purchased by the landlord from Sh. Vinod Malhotra vide sale deed No.4226 registered on 20.12.2007 with Sub Registrar, Chandigarh.")
Downstream Use
You can finetune this model on your own dataset.
Click to expand
from span_marker import SpanMarkerModel, Trainer
from span_marker.tokenizer import SpanMarkerTokenizer
# Download from the ๐ค Hub
model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-legal")
tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.config)
model.set_tokenizer(tokenizer)
# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003
# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("lambdavi/span-marker-luke-legal-finetuned")
Training Details
Training Set Metrics
Training set | Min | Median | Max |
---|---|---|---|
Sentence length | 3 | 44.5113 | 2795 |
Entities per sentence | 0 | 2.7232 | 68 |
Training Hyperparameters
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.06
- num_epochs: 5
Training Results
Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
---|---|---|---|---|---|---|
0.9997 | 1837 | 0.0137 | 0.7773 | 0.7994 | 0.7882 | 0.9577 |
2.0 | 3675 | 0.0090 | 0.8751 | 0.8348 | 0.8545 | 0.9697 |
2.9997 | 5512 | 0.0077 | 0.8777 | 0.8959 | 0.8867 | 0.9770 |
4.0 | 7350 | 0.0061 | 0.8941 | 0.9083 | 0.9011 | 0.9811 |
4.9986 | 9185 | 0.0064 | 0.9090 | 0.9110 | 0.9100 | 0.9824 |
Metric | Value |
---|---|
f1-exact | 0.9237 |
f1-strict | 0.9100 |
f1-partial | 0.9365 |
f1-type-match | 0.9277 |
Framework Versions
- Python: 3.10.12
- SpanMarker: 1.5.0
- Transformers: 4.36.0
- PyTorch: 2.0.0
- Datasets: 2.17.1
- Tokenizers: 0.15.0
Citation
BibTeX
@software{Aarsen_SpanMarker,
author = {Aarsen, Tom},
license = {Apache-2.0},
title = {{SpanMarker for Named Entity Recognition}},
url = {https://github.com/tomaarsen/SpanMarkerNER}
}
- Downloads last month
- 17
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Evaluation results
- F1 on legal_nerself-reported0.910
- Precision on legal_nerself-reported0.909
- Recall on legal_nerself-reported0.911