File size: 1,582 Bytes
23e9948
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: afl-3.0
language:
- ja
metrics:
- seqeval
library_name: transformers
pipeline_tag: token-classification
---
# SMM4H-2024 Task 2 Japanese NER

## Overview

This is a named entity extraction model created by fine-tuning [daisaku-s/medtxt_ner_roberta](https://huggingface.co./daisaku-s/medtxt_ner_roberta) on [SMM4H 2024 Task 2a](https://healthlanguageprocessing.org/smm4h-2024/) corpus.

Tag set (IOB2 format):
* DRUG
* DISORDER
* FUNCTION

## Usage

```python
from transformers import BertForTokenClassification, AutoTokenizer

import torch
text = "サンプルテキスト"
model_name = "yseop/SMM4H2024_Task2a_ja"
with torch.inference_mode():
    model = BertForTokenClassification.from_pretrained(model_name).eval()
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    idx2tag = model.config.id2label
    vecs = tokenizer(text, 
                     padding=True, 
                     truncation=True, 
                     return_tensors="pt")
    ner_logits = model(input_ids=vecs["input_ids"], 
                       attention_mask=vecs["attention_mask"])
    idx = torch.argmax(ner_logits.logits, dim=2).detach().cpu().numpy().tolist()[0]
    token = [tokenizer.convert_ids_to_tokens(v) for v in vecs["input_ids"]][0][1:-1]
    pred_tag = [idx2tag[x] for x in idx][1:-1]
```

## Results

|NE	|tp	|fp	|fn	|precision|	recall|	f1|
|---|---:|---:|---:|---:|---:|---:|
|DISORDER|	588	|409|	330|	0.5898|	0.6405|	0.6141|
|DRUG|	307	|143	|169|	0.6822|	0.645|	0.6631|
|FUNCTION|	69	|160	|170|	0.3013|	0.2887|	0.2949|
|all|	964|	712	|669	|0.5752	|0.5903	|0.5827|