umarbutler commited on
Commit
34481af
Β·
verified Β·
1 Parent(s): 47c02e2

Fixing typos.

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -67,9 +67,9 @@ Trained on 180,000 laws, regulations and decisions across six Australian jurisdi
67
  To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
68
 
69
  ## Usage πŸ‘©β€πŸ’»
70
- Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
71
 
72
- It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to peform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
73
  ```python
74
  import math
75
  import torch
@@ -180,7 +180,7 @@ The hyperparameters used to train EmuBert are as follows:
180
  | Adam beta2 | 0.98 | 0.98 |
181
  | Gradient clipping | 1 | 0 |
182
 
183
- Upon completion, the model achieved a training loss of 1.229, validation loss of 1.147 and a test loss of 1.126.
184
 
185
  The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).
186
 
@@ -198,7 +198,7 @@ EmuBert achieves a [(pseudo-)perplexity](https://doi.org/10.18653/v1/2020.acl-ma
198
  | Legalbert (pile-of-law) | 4.41 |
199
 
200
  ## Limitations 🚧
201
- It is worth noting that EmuBert may lack sufficently detailed knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. With that said, such knowledge should not be necessary to produce high-quality embeddings on general Australian legal texts, regardless of jurisdiction. Furthermore, finer jurisdictional knowledge should also be easily teachable through finetuning.
202
 
203
  One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
204
 
@@ -233,8 +233,7 @@ If you've relied on the model for your work, please cite:
233
  ```
234
 
235
  ## Acknowledgements πŸ™
236
- In the spirit of reconciliation, the author acknowledges the
237
- Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
238
 
239
  The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
240
 
 
67
  To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
68
 
69
  ## Usage πŸ‘©β€πŸ’»
70
+ Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta), which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
71
 
72
+ It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to perform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
73
  ```python
74
  import math
75
  import torch
 
180
  | Adam beta2 | 0.98 | 0.98 |
181
  | Gradient clipping | 1 | 0 |
182
 
183
+ Upon completion, the model achieved a training loss of 1.229, a validation loss of 1.147 and a test loss of 1.126.
184
 
185
  The code used to create EmuBert may be found [here](https://github.com/umarbutler/emubert-creator).
186
 
 
198
  | Legalbert (pile-of-law) | 4.41 |
199
 
200
  ## Limitations 🚧
201
+ It is worth noting that EmuBert may lack sufficiently detailed knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. With that said, such knowledge should not be necessary to produce high-quality embeddings on general Australian legal texts, regardless of jurisdiction. Furthermore, finer jurisdictional knowledge should also be easily teachable through finetuning.
202
 
203
  One might also reasonably expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).
204
 
 
233
  ```
234
 
235
  ## Acknowledgements πŸ™
236
+ In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.
 
237
 
238
  The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.
239