File size: 4,464 Bytes
4868c82 d677166 528b70e d677166 0b2c0aa d677166 528b70e d677166 4868c82 cc370a7 cfe4b13 4868c82 cc370a7 4868c82 cc370a7 4868c82 cc370a7 4868c82 cc370a7 4868c82 cc370a7 44d9104 4868c82 44d9104 4868c82 cc370a7 4868c82 cc370a7 4868c82 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
language: fr
pipeline_tag: "token-classification"
widget:
- text: "je voudrais réserver une chambre à paris pour demain et lundi"
- text: "d'accord pour l'hôtel à quatre vingt dix euros la nuit"
- text: "deux nuits s'il vous plait"
- text: "dans un hôtel avec piscine à marseille"
tags:
- bert
- flaubert
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
- MEDIA
---
# vpelloin/MEDIA_NLU-flaubert_base_uncased
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
It maps each input words into outputs concepts tags (76 available).
This model is trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co./flaubert/flaubert_base_uncased) as its inital checkpoint. It obtained 12.40% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352), using Kaldi ASR transcriptions.
## Available MEDIA NLU models:
- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co./flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co./flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co./nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co./nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co./nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co./vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co./nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.
## Usage with Pipeline
```python
from transformers import pipeline
generator = pipeline(
model="vpelloin/MEDIA_NLU-flaubert_base_uncased",
task="token-classification"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
for sentence in sentences:
print([(tok['word'], tok['entity']) for tok in generator(sentence)])
```
## Usage with AutoTokenizer/AutoModel
```python
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_base_uncased"
)
model = AutoModelForTokenClassification.from_pretrained(
"vpelloin/MEDIA_NLU-flaubert_base_uncased"
)
sentences = [
"je voudrais réserver une chambre à paris pour demain et lundi",
"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
"deux nuits s'il vous plait",
"dans un hôtel avec piscine à marseille"
]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
[model.config.id2label[i] for i in b]
for b in outputs.argmax(dim=-1).tolist()
])
```
## Reference
If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
```
@inproceedings{pelloin22_interspeech,
author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
year=2022,
booktitle={Proc. Interspeech 2022},
pages={3453--3457},
doi={10.21437/Interspeech.2022-352}
}
```
|