umarbutler commited on
Commit
d9fff54
·
verified ·
1 Parent(s): fa90c00

Added a progress bar to the code example.

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -66,9 +66,11 @@ Those interested in finetuning EmuBert can check out Hugging Face's documentatio
66
 
67
  It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
68
  ```python
 
69
  import torch
70
  import itertools
71
 
 
72
  from typing import Iterable, Generator
73
  from contextlib import nullcontext
74
  from transformers import AutoModel, AutoTokenizer
@@ -108,7 +110,7 @@ with torch.inference_mode(), \
108
  ):
109
  embeddings = []
110
 
111
- for batch in batch_generator(texts, BATCH_SIZE):
112
  inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
113
  token_embeddings = model(**inputs).last_hidden_state
114
 
 
66
 
67
  It is also possible to generate embeddings from the model which can be directly used for tasks like semantic similarity and clustering or for the training of downstream models. This can be done either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
68
  ```python
69
+ import math
70
  import torch
71
  import itertools
72
 
73
+ from tqdm import tqdm
74
  from typing import Iterable, Generator
75
  from contextlib import nullcontext
76
  from transformers import AutoModel, AutoTokenizer
 
110
  ):
111
  embeddings = []
112
 
113
+ for batch in tqdm(batch_generator(texts, BATCH_SIZE), total = math.ceil(len(texts) / BATCH_SIZE)):
114
  inputs = tokeniser(batch, return_tensors='pt', padding=True, truncation=True).to(device)
115
  token_embeddings = model(**inputs).last_hidden_state
116