umarbutler
commited on
Fixing typos.
Browse files
README.md
CHANGED
@@ -67,9 +67,9 @@ Trained on 180,000 laws, regulations and decisions across six Australian jurisdi
|
|
67 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
68 |
|
69 |
## Usage π©βπ»
|
70 |
-
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
71 |
|
72 |
-
It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to
|
73 |
```python
|
74 |
import math
|
75 |
import torch
|
@@ -180,7 +180,7 @@ The hyperparameters used to train EmuBert are as follows:
|
|
180 |
| Adam beta2 | 0.98 | 0.98 |
|
181 |
| Gradient clipping | 1 | 0 |
|
182 |
|
183 |
-
Upon completion, the model achieved a training loss of 1.229, validation loss of 1.147 and a test loss of 1.126.
|
184 |
|
185 |
The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).
|
186 |
|
@@ -198,7 +198,7 @@ EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-ma
|
|
198 |
| Legalbert (pile-of-law) | 4.41 |
|
199 |
|
200 |
## Limitations π§
|
201 |
-
It is worth noting that EmuBert may lack
|
202 |
|
203 |
One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
|
204 |
|
@@ -233,8 +233,7 @@ If you've relied on the model for your work, please cite:
|
|
233 |
```
|
234 |
|
235 |
## Acknowledgements π
|
236 |
-
In the spirit of reconciliation, the author acknowledges the
|
237 |
-
Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
|
238 |
|
239 |
The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
|
240 |
|
|
|
67 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
68 |
|
69 |
## Usage π©βπ»
|
70 |
+
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta), which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
71 |
|
72 |
+
It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to perform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
|
73 |
```python
|
74 |
import math
|
75 |
import torch
|
|
|
180 |
| Adam beta2 | 0.98 | 0.98 |
|
181 |
| Gradient clipping | 1 | 0 |
|
182 |
|
183 |
+
Upon completion, the model achieved a training loss of 1.229, a validation loss of 1.147 and a test loss of 1.126.
|
184 |
|
185 |
The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).
|
186 |
|
|
|
198 |
| Legalbert (pile-of-law) | 4.41 |
|
199 |
|
200 |
## Limitations π§
|
201 |
+
It is worth noting that EmuBert may lack sufficiently detailed knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. With that said, such knowledge should not be necessary to produce high-quality embeddings on general Australian legal texts, regardless of jurisdiction. Furthermore, finer jurisdictional knowledge should also be easily teachable through finetuning.
|
202 |
|
203 |
One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
|
204 |
|
|
|
233 |
```
|
234 |
|
235 |
## Acknowledgements π
|
236 |
+
In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
|
|
|
237 |
|
238 |
The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
|
239 |
|