File size: 4,464 Bytes
4868c82
d677166
 
528b70e
d677166
0b2c0aa
d677166
 
 
528b70e
 
 
 
 
 
 
 
 
d677166
 
4868c82
cc370a7
 
 
cfe4b13
4868c82
 
 
 
 
 
 
 
cc370a7
 
 
 
 
4868c82
 
 
 
cc370a7
4868c82
 
 
 
 
 
cc370a7
4868c82
 
 
cc370a7
 
 
 
 
 
4868c82
 
 
 
 
 
cc370a7
 
 
 
 
 
 
 
44d9104
4868c82
 
44d9104
4868c82
 
 
 
cc370a7
4868c82
cc370a7
4868c82
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

---
language: fr
pipeline_tag: "token-classification"
widget:
 - text: "je voudrais réserver une chambre à paris pour demain et lundi"
 - text: "d'accord pour l'hôtel à quatre vingt dix euros la nuit"
 - text: "deux nuits s'il vous plait"
 - text: "dans un hôtel avec piscine à marseille"
tags:
- bert
- flaubert 
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
- MEDIA
---

# vpelloin/MEDIA_NLU-flaubert_base_uncased
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
It maps each input words into outputs concepts tags (76 available).

This model is trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co./flaubert/flaubert_base_uncased) as its inital checkpoint. It obtained 12.40% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352), using Kaldi ASR transcriptions.

## Available MEDIA NLU models:
- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co./flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co./flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co./nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co./nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co./nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co./nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.

## Usage with Pipeline
```python
from transformers import pipeline

generator = pipeline(
    model="vpelloin/MEDIA_NLU-flaubert_base_uncased",
    task="token-classification"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]

for sentence in sentences:
    print([(tok['word'], tok['entity']) for tok in generator(sentence)])
```
## Usage with AutoTokenizer/AutoModel
```python
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)
model = AutoModelForTokenClassification.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
    [model.config.id2label[i] for i in b]
    for b in outputs.argmax(dim=-1).tolist()
])
```

## Reference

If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
```
@inproceedings{pelloin22_interspeech,
  author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
  title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={3453--3457},
  doi={10.21437/Interspeech.2022-352}
}
```