isaacus
/

emubert

@@ -66,9 +66,11 @@ Those interested in finetuning EmuBert can check out Hugging Face's documentatio
 It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
 ```python
 import torch
 import itertools
 from typing import Iterable, Generator
 from contextlib import nullcontext
 from transformers import AutoModel, AutoTokenizer
@@ -108,7 +110,7 @@ with torch.inference_mode(), \
     ):
         embeddings = []
-        for batch in batch_generator(texts, BATCH_SIZE):
             inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
             token_embeddings = model(**inputs).last_hidden_state

 It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
 ```python
+import math
 import torch
 import itertools
+from tqdm import tqdm
 from typing import Iterable, Generator
 from contextlib import nullcontext
 from transformers import AutoModel, AutoTokenizer
     ):
         embeddings = []
+        for batch in tqdm(batch_generator(texts, BATCH_SIZE), total = math.ceil(len(texts) / BATCH_SIZE)):
             inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
             token_embeddings = model(**inputs).last_hidden_state