Teja-Gollapudi
commited on
Commit
•
29ab34a
1
Parent(s):
249ba6c
Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,8 @@ license: "apache-2.0"
|
|
24 |
#### Motivation
|
25 |
Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>)
|
26 |
|
27 |
-
We have pretrained our vBERT model to address the aforementioned issues using our
|
|
|
28 |
|
29 |
#### Intended Use
|
30 |
The model functions as a VMware-specific Language Model.
|
|
|
24 |
#### Motivation
|
25 |
Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>)
|
26 |
|
27 |
+
We have pretrained our vBERT model to address the aforementioned issues using our <a href=https://medium.com/vmware-data-ml-blog/pretraining-a-custom-bert-model-6e37df97dfc4>BERT Pretraining Library</a>.
|
28 |
+
<br> We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-base-uncased' model for additional 78K steps (71k With MSL_128 and 7k with MSL_512) (approximately 5 epochs) on VMware domain data.
|
29 |
|
30 |
#### Intended Use
|
31 |
The model functions as a VMware-specific Language Model.
|