a-mannion commited on
Commit
7c5007b
·
verified ·
1 Parent(s): 1ebbf8e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - linformer
8
+ - legal
9
+ - medical
10
+ - RoBERTa
11
+ - pytorch
12
+ ---
13
+
14
+ # Jargon-general-base
15
+
16
+ [Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
17
+
18
+ Jargon is available in several versions with different context sizes and types of pre-training corpora.
19
+
20
+ <!-- Provide a quick summary of what the model is/does. -->
21
+
22
+ <!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
23
+ -->
24
+
25
+ | **Model** | **Initialised from...** |**Training Data**|
26
+ |-------------------------------------------------------------------------------------|:-----------------------:|:----------------:|
27
+ | [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus|
28
+ | [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus|
29
+ | jargon-general-legal | jargon-general-base |18GB Legal Corpus
30
+ | [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora|
31
+ | jargon-legal | scratch |18GB Legal Corpus|
32
+ | jargon-legal-4096 | scratch |18GB Legal Corpus|
33
+ | [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus|
34
+ | [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus|
35
+ | [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
36
+ | [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
37
+
38
+
39
+ ## Evaluation
40
+
41
+ The Jargon models were evaluated on an range of specialized downstream tasks.
42
+
43
+ For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
44
+
45
+
46
+ ## Using Jargon models with HuggingFace transformers
47
+
48
+ You can get started with `jargon-general-base` using the code snippet below:
49
+
50
+ ```python
51
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-base", trust_remote_code=True)
54
+ model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-base", trust_remote_code=True)
55
+
56
+ jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
57
+ output = jargon_maskfiller("Il est allé au <mask> hier")
58
+ ```
59
+
60
+ You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
61
+
62
+ - **Language(s):** French
63
+ - **License:** MIT
64
+ - **Developed by:** Vincent Segonne
65
+ - **Funded by**
66
+ - GENCI-IDRIS (Grant 2022 A0131013801)
67
+ - French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
68
+ - MIAI@Grenoble Alpes ANR-19-P3IA-0003
69
+ - PROPICTO ANR-20-CE93-0005
70
+ - Lawbot ANR-20-CE38-0013
71
+ - Swiss National Science Foundation (grant PROPICTO N°197864)
72
+ - **Authors**
73
+ - Vincent Segonne
74
+ - Aidan Mannion
75
+ - Laura Cristina Alonzo Canul
76
+ - Alexandre Audibert
77
+ - Xingyu Liu
78
+ - Cécile Macaire
79
+ - Adrien Pupier
80
+ - Yongxin Zhou
81
+ - Mathilde Aguiar
82
+ - Felix Herron
83
+ - Magali Norré
84
+ - Massih-Reza Amini
85
+ - Pierrette Bouillon
86
+ - Iris Eshkol-Taravella
87
+ - Emmanuelle Esperança-Rodier
88
+ - Thomas François
89
+ - Lorraine Goeuriot
90
+ - Jérôme Goulian
91
+ - Mathieu Lafourcade
92
+ - Benjamin Lecouteux
93
+ - François Portet
94
+ - Fabien Ringeval
95
+ - Vincent Vandeghinste
96
+ - Maximin Coavoux
97
+ - Marco Dinarelli
98
+ - Didier Schwab
99
+
100
+
101
+
102
+ ## Citation
103
+
104
+ If you use this model for your own research work, please cite as follows:
105
+
106
+ ```bibtex
107
+ @inproceedings{segonne:hal-04535557,
108
+ TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
109
+ AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
110
+ URL = {https://hal.science/hal-04535557},
111
+ BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
112
+ ADDRESS = {Turin, Italy},
113
+ YEAR = {2024},
114
+ MONTH = May,
115
+ KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
116
+ PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
117
+ HAL_ID = {hal-04535557},
118
+ HAL_VERSION = {v1},
119
+ }
120
+ ```
121
+
122
+
123
+
124
+ <!-- - **Finetuned from model [optional]:** [More Information Needed] -->
125
+ <!--
126
+ ### Model Sources [optional]
127
+
128
+
129
+ <!-- Provide the basic links for the model. -->