umarbutler
commited on
Added a progress bar to the code example.
Browse files
README.md
CHANGED
@@ -66,9 +66,11 @@ Those interested in finetuning EmuBert can check out Hugging Face's documentatio
|
|
66 |
|
67 |
It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
|
68 |
```python
|
|
|
69 |
import torch
|
70 |
import itertools
|
71 |
|
|
|
72 |
from typing import Iterable, Generator
|
73 |
from contextlib import nullcontext
|
74 |
from transformers import AutoModel, AutoTokenizer
|
@@ -108,7 +110,7 @@ with torch.inference_mode(), \
|
|
108 |
):
|
109 |
embeddings = []
|
110 |
|
111 |
-
for batch in batch_generator(texts, BATCH_SIZE):
|
112 |
inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
|
113 |
token_embeddings = model(**inputs).last_hidden_state
|
114 |
|
|
|
66 |
|
67 |
It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
|
68 |
```python
|
69 |
+
import math
|
70 |
import torch
|
71 |
import itertools
|
72 |
|
73 |
+
from tqdm import tqdm
|
74 |
from typing import Iterable, Generator
|
75 |
from contextlib import nullcontext
|
76 |
from transformers import AutoModel, AutoTokenizer
|
|
|
110 |
):
|
111 |
embeddings = []
|
112 |
|
113 |
+
for batch in tqdm(batch_generator(texts, BATCH_SIZE), total = math.ceil(len(texts) / BATCH_SIZE)):
|
114 |
inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
|
115 |
token_embeddings = model(**inputs).last_hidden_state
|
116 |
|