Update README.md
Browse files
README.md
CHANGED
@@ -6,6 +6,8 @@ tags:
|
|
6 |
- NLP
|
7 |
- legal field
|
8 |
- python
|
|
|
|
|
9 |
---
|
10 |
|
11 |
|
@@ -30,16 +32,10 @@ If you use our library in your academic work, please cite us in the following wa
|
|
30 |
|
31 |
0. [Accessing the Language Models](#0)
|
32 |
1. [ Introduction / Installing package](#1)
|
33 |
-
2. [
|
34 |
-
1. [
|
35 |
-
|
36 |
-
|
37 |
-
1. [ Phraser ](#3.1)
|
38 |
-
2. [ Word2Vec/Doc2Vec ](#3.2)
|
39 |
-
3. [ FastText ](#3.3)
|
40 |
-
4. [ BERTikal ](#3.4)
|
41 |
-
4. [ Demonstrations / Tutorials](#4)
|
42 |
-
5. [ References](#5)
|
43 |
|
44 |
--------------
|
45 |
|
@@ -49,9 +45,6 @@ If you use our library in your academic work, please cite us in the following wa
|
|
49 |
|
50 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
51 |
|
52 |
-
Some models can be download directly using our function `get_premodel` (more details in section [Other Functions](#2.2)).
|
53 |
-
|
54 |
-
|
55 |
Please contact *[email protected]* if you have any problem accessing the language models.
|
56 |
|
57 |
--------------
|
@@ -61,159 +54,30 @@ Please contact *[email protected]* if you have any problem accessing the
|
|
61 |
*LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.
|
62 |
|
63 |
|
64 |
-
You
|
65 |
``` :sh
|
66 |
-
$ pip install
|
67 |
-
```
|
68 |
-
|
69 |
-
You can load all our functions running the following command
|
70 |
-
|
71 |
-
```python
|
72 |
-
from legalnlp.clean_functions import *
|
73 |
-
from legalnlp.get_premodel import *
|
74 |
```
|
75 |
|
76 |
-
|
77 |
-
--------------
|
78 |
-
|
79 |
-
<a name="2"></a>
|
80 |
-
## 2\. Functions
|
81 |
-
<a name="2.1"></a>
|
82 |
-
### 2.1\. Text Cleaning Functions
|
83 |
-
|
84 |
-
|
85 |
-
<a name="2.1.1"></a>
|
86 |
-
#### 2.1.1\. `clean(text, lower=True, return_masked=False)`
|
87 |
-
Function for cleaning texts to be used (optional) in conjunction with Doc2Vec, Word2Vec, and FastText models. We use RegEx to mask/extract information such as email addresses, URLs, dates, numbers, monetary values, etc.
|
88 |
-
|
89 |
-
**input**:
|
90 |
-
|
91 |
-
- *text*, **str**;
|
92 |
-
|
93 |
-
- *lower*, **bool**, default=**True**. If lower==True, function lower cases the whole text. Note that all the models (except BERT) were trained with lower cased texts;
|
94 |
-
|
95 |
-
- *return_masked*, **bool**, default=**True**. If return_masked == False, the function outputs a clean text. Otherwise, it returns a dictionary containing the clean text and the information extracted by RegEx;
|
96 |
-
|
97 |
-
**output**:
|
98 |
-
|
99 |
-
- Clean text or dictionary, depending on the *return_masked* parameter;
|
100 |
-
|
101 |
-
|
102 |
-
<a name="2.1.2"></a>
|
103 |
-
#### 2.1.2\.`clean_bert(text)`
|
104 |
-
|
105 |
-
Function for cleaning the texts to be used (optional) in conjunction with the BERT model.
|
106 |
-
|
107 |
-
**input:**
|
108 |
-
|
109 |
-
- *text*, **str**.
|
110 |
-
|
111 |
-
**output:**
|
112 |
-
|
113 |
-
- **str** with clean text.
|
114 |
-
|
115 |
-
<a name="2.2"></a>
|
116 |
-
### 2.2\. Other functions
|
117 |
-
|
118 |
-
#### 2.2.2\. `get_premodel(model)`
|
119 |
-
|
120 |
-
Function to download a pre-trained model in the same folder as the file that is being executed.
|
121 |
-
|
122 |
-
**input:**
|
123 |
-
|
124 |
-
- *model*, **str**. Must contain the name of the pre-trained model that one wants to use. There are these options:
|
125 |
-
- **model = "bert"**: Download a .zip file containing BERTikal model and unzip it.
|
126 |
-
- **model = "wdoc"**: Download Word2Vec and Do2vec pre-trained models in a.zip file and unzip it. It has 2 two files, one with an size 100 Doc2Vec Distributed Memory/ Word2Vec Continuous Bag-of-Words (CBOW) embeddings model and other with an size 100 Doc2Vec Distributed Bag-of-Words (DBOW)/ Word2Vec Skip-Gram (SG) embeddings model.
|
127 |
-
- **model = "fasttext"**: Download a .zip file containing 100 sized FastText CBOW/SG models and unzip it.
|
128 |
-
- **model = "phraser"**: Download Phraser pre-trained model in a .zip file and unzip it. It has 2 two files with phraser1 and phreaser2. We explain how to use them in Section [ Phraser ](#3.1).
|
129 |
-
- **model = "w2vnilc"**: Download size 100 Word2Vec CBOW model trained by "Núcleo Interinstitucional de Linguística Computacional - USP" embeddings model in a .zip file and unzip it. [Click here for more details](http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc).
|
130 |
-
- **model = "neuralmindbase"**: Download a .zip file containing base BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
131 |
-
- **model = "neuralmindlarge"**: Download a .zip file containing large BERT model (PyTorch), trained by NeuralMind and unzip it. For more informations about BERT models made by NeuralMind go to [their GitHub repo](https://github.com/neuralmind-ai/portuguese-bert).
|
132 |
-
|
133 |
-
|
134 |
-
**output:**
|
135 |
-
|
136 |
-
- True if download of some model was made and False otherwise.
|
137 |
-
|
138 |
-
|
139 |
-
#### 2.2.1\. `extract_features_bert(path_model, path_tokenizer, data, gpu=True)`
|
140 |
-
|
141 |
-
Function for extracting features with the BERT model (This function is not accessed through the package installation, but you can find it [here](https://github.com/felipemaiapolo/legalnlp/blob/main/demo/BERT/extract_features_bert.ipynb)).
|
142 |
-
|
143 |
-
|
144 |
-
**Input:**
|
145 |
-
|
146 |
-
- *path_model*, **str**. Must contain the path of the pre-trained model;
|
147 |
-
|
148 |
-
- *path_tokenizer*, **str**. Must contain the path of tokenizer;
|
149 |
-
|
150 |
-
- *data*, **list**. Must contain a list of texts that will be extracted features;
|
151 |
-
|
152 |
-
- *gpu*, **bool**, default=**True**. If gpu==False, the GPU will not be used in the model application (we recommend feature extraction to be done using Google Colab).
|
153 |
-
|
154 |
-
|
155 |
-
**Output:**
|
156 |
-
|
157 |
-
- **DataFrame** with features extracted by BERT model.
|
158 |
-
|
159 |
-
|
160 |
-
<a name="3"></a>
|
161 |
-
## 3\. Model Languages
|
162 |
-
|
163 |
-
<a name="3.1"></a>
|
164 |
-
### 3.1\. Phraser
|
165 |
-
|
166 |
-
Phraser is a statistical method proposed in the natural language processing
|
167 |
-
literature [1] for identifying which words when they appear
|
168 |
-
together, can be considered as unique tokens. This method application is able to
|
169 |
-
identify the relevance of the occurrence of a bigram against the occurrence of the
|
170 |
-
words that make it up separately. Thus, we can identify that a bigram like "São
|
171 |
-
Paulo" should be treated as a single token, for example. If the method is applied
|
172 |
-
a second time in sequence, we can check which are the relevant trigrams and
|
173 |
-
quadrigrams. Since the two applications should be done with different Phraser
|
174 |
-
models, it can be the case that the second application identifies bigrams that were
|
175 |
-
not identified by the first model.
|
176 |
-
|
177 |
-
This model is compatible with the `clean` function, but it is not necessary to use it before. Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/phrases.html) for more details. Preferably use Gensim version 3.8.3.
|
178 |
-
|
179 |
-
#### Using *Phraser*
|
180 |
-
Installing Gensim
|
181 |
-
|
182 |
|
183 |
```python
|
184 |
-
|
185 |
```
|
186 |
|
187 |
-
|
188 |
-
|
189 |
|
190 |
```python
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
#Loading two Phraser models
|
195 |
-
phraser1=Phraser.load('models_phraser/phraser1')
|
196 |
-
phraser2=Phraser.load('models_phraser/phraser2')
|
197 |
```
|
198 |
|
|
|
199 |
|
200 |
-
Applying Phraser once and twice to check output
|
201 |
-
|
202 |
-
|
203 |
-
```python
|
204 |
-
txt='direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios'
|
205 |
-
tokens=txt.split()
|
206 |
|
207 |
-
print('Clean Text: "'+' '.join(tokens)+'"')
|
208 |
-
print('\nApplying Phraser 1x: "'+' '.join(phraser1[tokens])+'"')
|
209 |
-
print('\nApplying Phraser 2x: "'+' '.join(phraser2[phraser1[tokens]])+'"')
|
210 |
-
```
|
211 |
|
212 |
-
|
213 |
-
|
214 |
-
Applying Phraser 1x: "direito do consumidor origem : bangu regional xxix juizado_especial civel_ação : [processo] - - recte : fundo de investimento em direitos_creditórios"
|
215 |
-
|
216 |
-
Applying Phraser 2x: "direito do consumidor origem : bangu_regional xxix juizado_especial_civel_ação : [processo] - - recte : fundo de investimento em direitos_creditórios"
|
217 |
|
218 |
<a name="3.2"></a>
|
219 |
### 3.2\. Word2Vec/Doc2Vec
|
@@ -226,7 +90,7 @@ the meaning of the various textual elements, based on the contexts in which thes
|
|
226 |
elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
|
227 |
for generating whole text representations.
|
228 |
|
229 |
-
|
230 |
|
231 |
|
232 |
Below we have a summary table with some important information about the trained models:
|
@@ -239,8 +103,7 @@ Below we have a summary table with some important information about the trained
|
|
239 |
| ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW) | Skip-Gram (SG) | 100, 200, 300 | 15
|
240 |
|
241 |
|
242 |
-
|
243 |
-
|
244 |
|
245 |
#### Using *Word2Vec*
|
246 |
|
@@ -251,14 +114,14 @@ Installing Gensim
|
|
251 |
!pip install gensim=='3.8.3'
|
252 |
```
|
253 |
|
254 |
-
Loading W2V
|
255 |
|
256 |
|
257 |
```python
|
258 |
from gensim.models import KeyedVectors
|
259 |
|
260 |
#Loading a W2V model
|
261 |
-
w2v=KeyedVectors.load(
|
262 |
w2v=w2v.wv
|
263 |
```
|
264 |
Viewing the first 10 entries of 'juiz' vector
|
@@ -307,14 +170,14 @@ Installing Gensim
|
|
307 |
!pip install gensim=='3.8.3'
|
308 |
```
|
309 |
|
310 |
-
Loading D2V
|
311 |
|
312 |
|
313 |
```python
|
314 |
from gensim.models import Doc2Vec
|
315 |
|
316 |
#Loading a D2V model
|
317 |
-
d2v=Doc2Vec.load(
|
318 |
```
|
319 |
|
320 |
Inferring vector for a text
|
@@ -338,109 +201,6 @@ txt_vec[:10]
|
|
338 |
|
339 |
|
340 |
|
341 |
-
<a name="3.3"></a>
|
342 |
-
### 3.3\. FastText
|
343 |
-
|
344 |
-
The FastText [4] methods, like Word2Vec, form a class of
|
345 |
-
models for creating vector representations (embeddings) for tokens. Unlike
|
346 |
-
Word2Vec, which disregards the morphology of the tokens and allocates a
|
347 |
-
different vector for each one of them, the FastText methods consider that each one
|
348 |
-
of the tokens is formed by n-grams of characters or substrings. In this way, the
|
349 |
-
representation of tokens which do not appear in the training set can be inferred
|
350 |
-
from the representation of substrings. Also, rare tokens can have more robust
|
351 |
-
representations than those returned by the Word2Vec methods.
|
352 |
-
|
353 |
-
Models are compatible with the `clean` function, but it is not necessary to use it. Remember to at least make all letters lowercase. Please check our paper or the [Gensim page](https://radimrehurek.com/gensim/models/fasttext.html) for more details. Preferably use Gensim version 4.0.1.
|
354 |
-
|
355 |
-
Below we have a summary table with some important information about the trained models:
|
356 |
-
|
357 |
-
| Filenames | FastText | Sizes | Windows
|
358 |
-
|:-------------------:|:--------------:|:--------------:|:--------------:|
|
359 |
-
| ```fasttext_cbow*``` | Continuous Bag-of-Words (CBOW) | 100, 200, 300 | 15
|
360 |
-
| ```fasttext_sg*``` | Skip-Gram (SG) | 100, 200, 300 | 15
|
361 |
-
|
362 |
-
|
363 |
-
#### Using *FastText*
|
364 |
-
|
365 |
-
installing Gensim
|
366 |
-
|
367 |
-
|
368 |
-
```python
|
369 |
-
!pip install gensim=='4.0.1'
|
370 |
-
```
|
371 |
-
|
372 |
-
Loading FastText (all the files for the specific model should be in the same folder)
|
373 |
-
|
374 |
-
|
375 |
-
```python
|
376 |
-
from gensim.models import FastText
|
377 |
-
|
378 |
-
#Loading a FastText model
|
379 |
-
fast=FastText.load('models_fasttext/fasttext_sg_size_100_window_15_epochs_20')
|
380 |
-
fast=fast.wv
|
381 |
-
```
|
382 |
-
|
383 |
-
Viewing the first 10 entries of 'juiz' vector
|
384 |
-
|
385 |
-
|
386 |
-
|
387 |
-
```python
|
388 |
-
fast['juiz'][:10]
|
389 |
-
```
|
390 |
-
|
391 |
-
|
392 |
-
|
393 |
-
|
394 |
-
array([ 0.46769685, 0.62529474, 0.08549586, 0.09621219, -0.09998254,
|
395 |
-
-0.07897531, 0.32838237, -0.33229044, -0.05959201, -0.5865443 ],
|
396 |
-
dtype=float32)
|
397 |
-
|
398 |
-
|
399 |
-
|
400 |
-
Viewing the first 10 vector entries of a token that was not in our vocabulary
|
401 |
-
|
402 |
-
|
403 |
-
```python
|
404 |
-
fast['juizasjashdkjhaskda'][:10]
|
405 |
-
```
|
406 |
-
|
407 |
-
|
408 |
-
|
409 |
-
|
410 |
-
array([ 0.02795791, 0.1361525 , 0.1340836 , -0.36824217, -0.11549155,
|
411 |
-
-0.11167661, 0.32045627, -0.33701468, -0.05198409, -0.05513595],
|
412 |
-
dtype=float32)
|
413 |
-
|
414 |
-
|
415 |
-
<a name="3.4"></a>
|
416 |
-
### 3.4\. BERTikal
|
417 |
-
|
418 |
-
|
419 |
-
We call BERTikal our BERT-Base model (cased) [5] for Brazilian legal language. BERT models are models based on neural network architectures called Transformers. BERT models are trained with large sets of texts using the self-supervised paradigm, which is basically solving unsupervised problems using supervised techniques. A pre-trained BERT model is capable of generating representations for entire texts and can be adapted for a supervised task, e.g., text classification or question answering, using the fine-tuning mechanism.
|
420 |
-
|
421 |
-
BERTikal was trained using the Python package [Transformers](https://huggingface.co/transformers/}) in its 4.2.2 version and its checkpoint made available by us is compatible with [PyTorch](https://pytorch.org/) 1.9.0. Although we expose the versions of both packages, more current versions can be used in applications of the model, as long as there are no relevant version conflicts.
|
422 |
-
|
423 |
-
Our model was trained from the checkpoint made available in [Neuralmind’s Github repository](https://github.com/neuralmind-ai/portuguese-bert) by the authors of recent research [6].
|
424 |
-
|
425 |
-
#### Using *BERTikal*
|
426 |
-
|
427 |
-
Installing Torch e Transformers
|
428 |
-
|
429 |
-
|
430 |
-
```python
|
431 |
-
!pip install torch=='1.8.1' transformers=='4.2.2'
|
432 |
-
```
|
433 |
-
|
434 |
-
Loading BERT (all the files for the specific model should be in the same folder)
|
435 |
-
|
436 |
-
|
437 |
-
```python
|
438 |
-
from transformers import BertModel, BertTokenizer
|
439 |
-
|
440 |
-
bert_tokenizer = BertTokenizer.from_pretrained('model_bertikal/', do_lower_case=False)
|
441 |
-
bert_model = BertModel.from_pretrained('model_bertikal/')
|
442 |
-
```
|
443 |
-
|
444 |
--------------
|
445 |
|
446 |
<a name="4"></a>
|
|
|
6 |
- NLP
|
7 |
- legal field
|
8 |
- python
|
9 |
+
- word2vec
|
10 |
+
- doc2vec
|
11 |
---
|
12 |
|
13 |
|
|
|
32 |
|
33 |
0. [Accessing the Language Models](#0)
|
34 |
1. [ Introduction / Installing package](#1)
|
35 |
+
2. [ Language Models (Details / How to use)](#2)
|
36 |
+
1. [ Word2Vec/Doc2Vec ](#2.1)
|
37 |
+
3. [ Demonstrations / Tutorials](#3)
|
38 |
+
4. [ References](#4)
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
--------------
|
41 |
|
|
|
45 |
|
46 |
All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).
|
47 |
|
|
|
|
|
|
|
48 |
Please contact *[email protected]* if you have any problem accessing the language models.
|
49 |
|
50 |
--------------
|
|
|
54 |
*LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.
|
55 |
|
56 |
|
57 |
+
You first need to install the HuggingFaceHub library running the following command on terminal
|
58 |
``` :sh
|
59 |
+
$ pip install huggingface_hub
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
```
|
61 |
|
62 |
+
Import `hf_hub_download`:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
|
64 |
```python
|
65 |
+
from huggingface_hub import hf_hub_download
|
66 |
```
|
67 |
|
68 |
+
And then you can download our Word2Vec(SG)/Doc2Vec(DBOW) and Word2Vec(CBOW)/Doc2Vec(DM) by the following commands:
|
|
|
69 |
|
70 |
```python
|
71 |
+
w2v_sg_d2v_dbow = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dbow_size_100_window_15_epochs_20")
|
72 |
+
w2v_cbow_d2v_dm = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dm_size_100_window_15_epochs_20")
|
|
|
|
|
|
|
|
|
73 |
```
|
74 |
|
75 |
+
--------------
|
76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
|
|
|
|
|
|
|
|
78 |
|
79 |
+
<a name="2"></a>
|
80 |
+
## 2\. Model Languages
|
|
|
|
|
|
|
81 |
|
82 |
<a name="3.2"></a>
|
83 |
### 3.2\. Word2Vec/Doc2Vec
|
|
|
90 |
elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
|
91 |
for generating whole text representations.
|
92 |
|
93 |
+
Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html) for more details. Preferably use Gensim version 3.8.3.
|
94 |
|
95 |
|
96 |
Below we have a summary table with some important information about the trained models:
|
|
|
103 |
| ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW) | Skip-Gram (SG) | 100, 200, 300 | 15
|
104 |
|
105 |
|
106 |
+
Here we made available both models with 100 size and 15 window.
|
|
|
107 |
|
108 |
#### Using *Word2Vec*
|
109 |
|
|
|
114 |
!pip install gensim=='3.8.3'
|
115 |
```
|
116 |
|
117 |
+
Loading W2V:
|
118 |
|
119 |
|
120 |
```python
|
121 |
from gensim.models import KeyedVectors
|
122 |
|
123 |
#Loading a W2V model
|
124 |
+
w2v=KeyedVectors.load(w2v_cbow_d2v_dm)
|
125 |
w2v=w2v.wv
|
126 |
```
|
127 |
Viewing the first 10 entries of 'juiz' vector
|
|
|
170 |
!pip install gensim=='3.8.3'
|
171 |
```
|
172 |
|
173 |
+
Loading D2V
|
174 |
|
175 |
|
176 |
```python
|
177 |
from gensim.models import Doc2Vec
|
178 |
|
179 |
#Loading a D2V model
|
180 |
+
d2v=Doc2Vec.load(w2v_cbow_d2v_dm)
|
181 |
```
|
182 |
|
183 |
Inferring vector for a text
|
|
|
201 |
|
202 |
|
203 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
204 |
--------------
|
205 |
|
206 |
<a name="4"></a>
|