umarbutler commited on
Commit
db148eb
·
verified ·
1 Parent(s): 4d96efa
Files changed (11) hide show
  1. .gitattributes +1 -0
  2. LICENCE.md +21 -0
  3. README.md +206 -0
  4. config.json +26 -0
  5. logo.png +3 -0
  6. merges.txt +0 -0
  7. model.safetensors +3 -0
  8. special_tokens_map.json +15 -0
  9. tokenizer.json +0 -0
  10. tokenizer_config.json +57 -0
  11. vocab.json +0 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ logo.png filter=lfs diff=lfs merge=lfs -text
LICENCE.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2024 Umar Butler
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,209 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: mit
5
+ library_name: transformers
6
+ base_model: roberta-base
7
+ tags:
8
+ - law
9
+ - legal
10
+ - australia
11
+ - generated_from_trainer
12
+ - fill-mask
13
+ - sentence-similarity
14
+ - feature-extraction
15
+ datasets:
16
+ - umarbutler/open-australian-legal-corpus
17
+ widget:
18
+ - text: "Section <mask> of the Constitution grants the Australian Parliament the power to make laws for the peace, order, and good government of the Commonwealth."
19
+ - text: "The most learned and eminent jurist in Australia's history is <mask> CJ."
20
+ - text: "A <mask> of trade is valid to the extent to which it is not against public policy, whether it is in severable terms or not."
21
+ - text: "In Mabo v <mask> (No 2) (1992) 175 CLR 1, the Court found that the doctrine of terra nullius was not applicable to Australia at the time of British settlement of New South Wales."
22
+ metrics:
23
+ - perplexity
24
+ model-index:
25
+ - name: emubert
26
+ results:
27
+ - task:
28
+ type: fill-mask
29
+ name: Fill mask
30
+ dataset:
31
+ type: umarbutler/open-australian-legal-qa
32
+ name: Open Australian Legal QA
33
+ split: train
34
+ revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
35
+ metrics:
36
+ - type: perplexity
37
+ value: 2.05
38
+ name: perplexity
39
+ source:
40
+ name: EmuBert Creator
41
+ url: https://github.com/umarbutler/emubert-creator
42
  ---
43
+
44
+ # EmuBert
45
+ <img src="https://huggingface.co/umarbutler/emubert/raw/main/logo.png" width="100" align="left" />
46
+ EmuBert is the largest open-source masked language model for Australian law.
47
+
48
+ Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition** and **question answering**. It can also be used as-is for **semantic similarity**, **vector search** and general **sentence embedding**.
49
+
50
+ To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
51
+
52
+ ## Usage 👩‍💻
53
+ Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
54
+
55
+ It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
56
+ ```python
57
+ import torch
58
+ import itertools
59
+
60
+ from typing import Iterable, Generator
61
+ from contextlib import nullcontext
62
+ from transformers import AutoModel, AutoTokenizer
63
+
64
+ BATCH_SIZE = 8
65
+
66
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
67
+ model = AutoModel.from_pretrained('model').to(device)
68
+ model = model.to_bettertransformer() # Optional: convert the model into a BetterTransformer
69
+ # to speed it up.
70
+ tokeniser = AutoTokenizer.from_pretrained('model')
71
+
72
+ texts = [
73
+ 'The Parliament shall, subject to this Constitution,\
74
+ have power to make laws for the peace, order, and good\
75
+ government of the Commonwealth.',
76
+
77
+ 'The executive power of the Commonwealth is vested in the Queen\
78
+ and is exercisable by the Governor-General as the Queen’s representative,\
79
+ and extends to the execution and maintenance of this Constitution,\
80
+ and of the laws of the Commonwealth.',
81
+ ]
82
+
83
+ def batch_generator(iterable: Iterable, batch_size: int) -> Generator[list, None, None]:
84
+ """Generate batches of the specified size from the provided iterable."""
85
+
86
+ iterator = iter(iterable)
87
+
88
+ for first in iterator:
89
+ yield list(itertools.chain([first], itertools.islice(iterator, batch_size - 1)))
90
+
91
+ with torch.inference_mode(), \
92
+ ( # Optional: use mixed precision to speed up inference.
93
+ torch.cuda.amp.autocast()
94
+ if torch.cuda.is_available()
95
+ else nullcontext()
96
+ ):
97
+ embeddings = []
98
+
99
+ for batch in batch_generator(texts, BATCH_SIZE):
100
+ inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
101
+ token_embeddings = model(**inputs).last_hidden_state
102
+
103
+ # Perform mean pooling, ignoring padding.
104
+ mask = inputs['attention_mask'].unsqueeze(-1).expand(token_embeddings.size()).float()
105
+ summed = torch.sum(mask * token_embeddings, 1)
106
+ summed_mask = torch.clamp(mask.sum(1), min=1e-9)
107
+ embeddings.extend(summed / summed_mask)
108
+ ```
109
+
110
+ ## Creation 🧪
111
+ 202,260 Australian laws, regulations and decisions were first collected from [version 4.2.1](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus/tree/fe0cd918dbe0a1fb5afe09cfa682ec3dbc1b94ca) of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus). A breakdown of the Corpus' composition by source and document type is provided below:
112
+ | Source | Primary Legislation | Secondary Legislation | Bills | Decisions | **Total** |
113
+ |:--------------------------------|----------------------:|------------------------:|--------:|------------:|--------:|
114
+ | Federal Register of Legislation | 3,872 | 19,587 | 0 | 0 |**23,459**|
115
+ | Federal Court of Australia | 0 | 0 | 0 | 46,733 |**46,733**|
116
+ | High Court of Australia | 0 | 0 | 0 | 9,433 |**9,433**|
117
+ | NSW Caselaw | 0 | 0 | 0 | 111,882 |**111,882**|
118
+ | NSW Legislation | 1,428 | 800 | 0 | 0 |**2,228**|
119
+ | Queensland Legislation | 564 | 426 | 2,247 | 0 |**3,237**|
120
+ | Western Australian Legislation | 812 | 760 | 0 | 0 |**1,572**|
121
+ | South Australian Legislation | 557 | 471 | 154 | 0 |**1,182**|
122
+ | Tasmanian Legislation | 858 | 1,676 | 0 | 0 |**2,534**|
123
+ | **Total** |**8,091**|**23,720**|**2,401**|**168,048**|**202,260**|
124
+
125
+ Next, 62 documents that, when stripped of leading and trailing whitespace characters, were empty, were filtered out, leaving behind 202,198 documents. The following cleaning procedures were then applied to those documents:
126
+ 1. Non-breaking spaces were replaced with regular spaces;
127
+ 1. Return carriages followed by newlines were replaced with newlines;
128
+ 1. Whitespace was removed from lines comprised entirely of whitespace;
129
+ 1. Newlines and whitespace preceding newlines were removed from the end of texts;
130
+ 1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
131
+ 1. Spaces and tabs were removed from the end of lines.
132
+
133
+ After cleaning, the Corpus was split into a training set of 182,198 documents (90%) and validation and test sets of 10,000 documents each (5% each). Documents with less than 128 characters (23) and those with duplicate XXH3 128-bit hashes (29) were removed from the training split, resulting in a final set of 182,146 documents.
134
+
135
+ These documents were subsequently used to train a [Roberta](https://huggingface.co/roberta-base)-like tokeniser, after which each dataset was packed into blocks exactly 512-tokens-long, with documents being enclosed in beginning- (`<s>`) and end-of-sequence (`</s>`) tokens, which would often span multiple blocks, although end-of-sequence tokens were dropped wherever they would have been placed at the beginning of a block, as that would be unnecessary.
136
+
137
+ Whereas the final block of the training set would have been dropped if it did not reach the context window as EmuBert's architecture does not support padding during training, the final blocks of the validation and test sets were padded if necessary.
138
+
139
+ The final training set comprised 2,885,839 blocks totalling 1,477,549,568 tokens, the validation set comprised 155,563 blocks totalling 79,648,256 tokens, and the test set comprised 155,696 blocks totalling 79,716,352 tokens.
140
+
141
+ Instead of training EmuBert from scratch, [Roberta](https://huggingface.co/roberta-base)'s weights were all reused, except for its token embeddings which were either replaced with the average token embedding or, if a token was shared between Roberta and EmuBert's vocabularies, moved to its new position in EmuBert's vocabulary.
142
+
143
+ In order to reduce training time, [Better Transformer](https://huggingface.co/docs/optimum/en/bettertransformer/overview) was used to enable fast path execution and scaled dot-product attention, alongside automatic mixed 16-bit precision and [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.AdamW8bit)' 8-bit implementation of AdamW, all of which have been shown to have little to no detrimental effect on performance.
144
+
145
+ As with Roberta, 15% of tokens were uniformly sampled dynamically for each batch, with 80% being masked, 10% being replaced with random tokens and 10% being left unchanged.
146
+
147
+ The hyperparameters used to train EmuBert are as follows:
148
+
149
+ | Hyperparameter | EmuBert | Roberta |
150
+ | ----------------- | ----------- | ------- |
151
+ | Optimiser | AdamW 8-bit | Adam |
152
+ | Scheduler | Cosine | Linear |
153
+ | Precision | 16-bit | 16-bit |
154
+ | Batch size | 8 | 8,000 |
155
+ | Steps | 1,000,000 | 500,000 |
156
+ | Warmup steps | 48,000 | 24,000 |
157
+ | Learning rate | 1e-5 | 6e-4 |
158
+ | Weight decay | 0.01 | 0.01 |
159
+ | Adam epsilon | 1e-6 | 1e-6 |
160
+ | Adam beta1 | 0.9 | 0.9 |
161
+ | Adam beta2 | 0.98 | 0.98 |
162
+ | Gradient clipping | 1 | 0 |
163
+
164
+ Upon completion, the model achieved a training loss of 1.229, validation loss of 1.147 and a test loss of 1.126.
165
+
166
+ The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).
167
+
168
+ ## Benchmarks 📊
169
+ EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-main.240) of 2.05 against [version 2.0.0](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa/tree/b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae) of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, outperforming all known state-of-the-art masked language models, as shown below:
170
+
171
+ | Model | Perplexity |
172
+ | -------------- | ---------- |
173
+ | **EmuBert** | **2.05** |
174
+ | Roberta | 2.38 |
175
+ | Albert v2 | 3.49 |
176
+ | Bert (cased) | 2.18 |
177
+ | Bert (uncased) | 2.41 |
178
+
179
+ ## Limitations 🚧
180
+ Although informal testing has not revealed any racial, sexual, gender or other social biases, given that Roberta's weights were reused, it is possible that there may be some biases that have been transferred over to EmuBert. It is also possible that there are social biases present in the Corpus that may have been introduced via training. More rigorous testing is necessary to determine the true extent of any biases present in EmuBert.
181
+
182
+ One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
183
+
184
+ Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. With that said, such knowledge should not be necessary to produce high-quality embeddings on general Australian legal texts. Furthermore, such knowledge should be easily teachable through finetuning.
185
+
186
+ ## Licence 📜
187
+ To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
188
+
189
+ ## Citation 🔖
190
+ If you've relied on the model for your work, please cite:
191
+ ```bibtex
192
+ @misc{butler-2024-emubert,
193
+ author = {Butler, Umar},
194
+ year = {2024},
195
+ title = {EmuBert},
196
+ publisher = {Hugging Face},
197
+ version = {1.0.0},
198
+ url = {https://huggingface.co/datasets/umarbutler/emubert}
199
+ }
200
+ ```
201
+
202
+ ## Acknowledgements 🙏
203
+ In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
204
+
205
+ The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
206
+
207
+ The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of Roberta, which the model was built atop.
208
+
209
+ Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.36.2",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 50265
26
+ }
logo.png ADDED

Git LFS Details

  • SHA256: df1df525e97298d1342318098919ef74b17764321e05ca85c0ce92d7268f52f8
  • Pointer size: 132 Bytes
  • Size of remote file: 1.5 MB
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f23a8465ee6a7b52a02712eb27938c583efab89cf16c267cd2c0eb2f47e21f7a
3
+ size 498813948
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 512,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff