Jack Morris commited on
Commit
9792ab1
·
1 Parent(s): 6ab272b

reorg readme

Browse files
Files changed (1) hide show
  1. README.md +105 -104
README.md CHANGED
@@ -8662,6 +8662,111 @@ Our new model that naturally integrates "context tokens" into the embedding proc
8662
 
8663
  Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
8664
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8665
  ## With Sentence Transformers
8666
 
8667
  <details open="">
@@ -8832,110 +8937,6 @@ Top Document: Foster's Home for Imaginary Friends McCracken conceived the series
8832
 
8833
  </details>
8834
 
8835
- </details>
8836
-
8837
- ## With Transformers
8838
-
8839
- <details>
8840
- <summary>Click to learn how to use cde-small-v1 with Transformers</summary>
8841
-
8842
- ### Loading the model
8843
-
8844
- Our model can be loaded using `transformers` out-of-the-box with "trust remote code" enabled. We use the default BERT uncased tokenizer:
8845
- ```python
8846
- import transformers
8847
-
8848
- model = transformers.AutoModel.from_pretrained("jxm/cde-small-v1", trust_remote_code=True)
8849
- tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
8850
- ```
8851
-
8852
- #### Note on prefixes
8853
-
8854
- *Nota bene*: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can prepend the following strings to queries & documents:
8855
-
8856
- ```python
8857
- query_prefix = "search_query: "
8858
- document_prefix = "search_document: "
8859
- ```
8860
-
8861
- ### First stage
8862
-
8863
- ```python
8864
- minicorpus_size = model.config.transductive_corpus_size
8865
- minicorpus_docs = [ ... ] # Put some strings here that are representative of your corpus, for example by calling random.sample(corpus, k=minicorpus_size)
8866
- assert len(minicorpus_docs) == minicorpus_size # You must use exactly this many documents in the minicorpus. You can oversample if your corpus is smaller.
8867
- minicorpus_docs = tokenizer(
8868
- [document_prefix + doc for doc in minicorpus_docs],
8869
- truncation=True,
8870
- padding=True,
8871
- max_length=512,
8872
- return_tensors="pt"
8873
- ).to(model.device)
8874
- import torch
8875
- from tqdm.autonotebook import tqdm
8876
-
8877
- batch_size = 32
8878
-
8879
- dataset_embeddings = []
8880
- for i in tqdm(range(0, len(minicorpus_docs["input_ids"]), batch_size)):
8881
- minicorpus_docs_batch = {k: v[i:i+batch_size] for k,v in minicorpus_docs.items()}
8882
- with torch.no_grad():
8883
- dataset_embeddings.append(
8884
- model.first_stage_model(**minicorpus_docs_batch)
8885
- )
8886
-
8887
- dataset_embeddings = torch.cat(dataset_embeddings)
8888
- ```
8889
-
8890
- ### Running the second stage
8891
-
8892
- Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prefix for documents:
8893
- ```python
8894
- docs = tokenizer(
8895
- [document_prefix + doc for doc in docs],
8896
- truncation=True,
8897
- padding=True,
8898
- max_length=512,
8899
- return_tensors="pt"
8900
- ).to(model.device)
8901
-
8902
- with torch.no_grad():
8903
- doc_embeddings = model.second_stage_model(
8904
- input_ids=docs["input_ids"],
8905
- attention_mask=docs["attention_mask"],
8906
- dataset_embeddings=dataset_embeddings,
8907
- )
8908
- doc_embeddings /= doc_embeddings.norm(p=2, dim=1, keepdim=True)
8909
- ```
8910
-
8911
- and the query prefix for queries:
8912
- ```python
8913
- queries = queries.select(range(16))["text"]
8914
- queries = tokenizer(
8915
- [query_prefix + query for query in queries],
8916
- truncation=True,
8917
- padding=True,
8918
- max_length=512,
8919
- return_tensors="pt"
8920
- ).to(model.device)
8921
-
8922
- with torch.no_grad():
8923
- query_embeddings = model.second_stage_model(
8924
- input_ids=queries["input_ids"],
8925
- attention_mask=queries["attention_mask"],
8926
- dataset_embeddings=dataset_embeddings,
8927
- )
8928
- query_embeddings /= query_embeddings.norm(p=2, dim=1, keepdim=True)
8929
- ```
8930
-
8931
- these embeddings can be compared using dot product, since they're normalized.
8932
-
8933
- </details>
8934
-
8935
- ### What if I don't know what my corpus will be ahead of time?
8936
-
8937
- If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v1/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
8938
-
8939
  ### Colab demo
8940
 
8941
  We've set up a short demo in a Colab notebook showing how you might use our model:
 
8662
 
8663
  Our embedding model needs to be used in *two stages*. The first stage is to gather some dataset information by embedding a subset of the corpus using our "first-stage" model. The second stage is to actually embed queries and documents, conditioning on the corpus information from the first stage. Note that we can do the first stage part offline and only use the second-stage weights at inference time.
8664
 
8665
+ </details>
8666
+
8667
+ ## With Transformers
8668
+
8669
+ <details>
8670
+ <summary>Click to learn how to use cde-small-v1 with Transformers</summary>
8671
+
8672
+ ### Loading the model
8673
+
8674
+ Our model can be loaded using `transformers` out-of-the-box with "trust remote code" enabled. We use the default BERT uncased tokenizer:
8675
+ ```python
8676
+ import transformers
8677
+
8678
+ model = transformers.AutoModel.from_pretrained("jxm/cde-small-v1", trust_remote_code=True)
8679
+ tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
8680
+ ```
8681
+
8682
+ #### Note on prefixes
8683
+
8684
+ *Nota bene*: Like all state-of-the-art embedding models, our model was trained with task-specific prefixes. To do retrieval, you can prepend the following strings to queries & documents:
8685
+
8686
+ ```python
8687
+ query_prefix = "search_query: "
8688
+ document_prefix = "search_document: "
8689
+ ```
8690
+
8691
+ ### First stage
8692
+
8693
+ ```python
8694
+ minicorpus_size = model.config.transductive_corpus_size
8695
+ minicorpus_docs = [ ... ] # Put some strings here that are representative of your corpus, for example by calling random.sample(corpus, k=minicorpus_size)
8696
+ assert len(minicorpus_docs) == minicorpus_size # You must use exactly this many documents in the minicorpus. You can oversample if your corpus is smaller.
8697
+ minicorpus_docs = tokenizer(
8698
+ [document_prefix + doc for doc in minicorpus_docs],
8699
+ truncation=True,
8700
+ padding=True,
8701
+ max_length=512,
8702
+ return_tensors="pt"
8703
+ ).to(model.device)
8704
+ import torch
8705
+ from tqdm.autonotebook import tqdm
8706
+
8707
+ batch_size = 32
8708
+
8709
+ dataset_embeddings = []
8710
+ for i in tqdm(range(0, len(minicorpus_docs["input_ids"]), batch_size)):
8711
+ minicorpus_docs_batch = {k: v[i:i+batch_size] for k,v in minicorpus_docs.items()}
8712
+ with torch.no_grad():
8713
+ dataset_embeddings.append(
8714
+ model.first_stage_model(**minicorpus_docs_batch)
8715
+ )
8716
+
8717
+ dataset_embeddings = torch.cat(dataset_embeddings)
8718
+ ```
8719
+
8720
+ ### Running the second stage
8721
+
8722
+ Now that we have obtained "dataset embeddings" we can embed documents and queries like normal. Remember to use the document prefix for documents:
8723
+ ```python
8724
+ docs = tokenizer(
8725
+ [document_prefix + doc for doc in docs],
8726
+ truncation=True,
8727
+ padding=True,
8728
+ max_length=512,
8729
+ return_tensors="pt"
8730
+ ).to(model.device)
8731
+
8732
+ with torch.no_grad():
8733
+ doc_embeddings = model.second_stage_model(
8734
+ input_ids=docs["input_ids"],
8735
+ attention_mask=docs["attention_mask"],
8736
+ dataset_embeddings=dataset_embeddings,
8737
+ )
8738
+ doc_embeddings /= doc_embeddings.norm(p=2, dim=1, keepdim=True)
8739
+ ```
8740
+
8741
+ and the query prefix for queries:
8742
+ ```python
8743
+ queries = queries.select(range(16))["text"]
8744
+ queries = tokenizer(
8745
+ [query_prefix + query for query in queries],
8746
+ truncation=True,
8747
+ padding=True,
8748
+ max_length=512,
8749
+ return_tensors="pt"
8750
+ ).to(model.device)
8751
+
8752
+ with torch.no_grad():
8753
+ query_embeddings = model.second_stage_model(
8754
+ input_ids=queries["input_ids"],
8755
+ attention_mask=queries["attention_mask"],
8756
+ dataset_embeddings=dataset_embeddings,
8757
+ )
8758
+ query_embeddings /= query_embeddings.norm(p=2, dim=1, keepdim=True)
8759
+ ```
8760
+
8761
+ these embeddings can be compared using dot product, since they're normalized.
8762
+
8763
+ </details>
8764
+
8765
+ ### What if I don't know what my corpus will be ahead of time?
8766
+
8767
+ If you can't obtain corpus information ahead of time, you still have to pass *something* as the dataset embeddings; our model will work fine in this case, but not quite as well; without corpus information, our model performance drops from 65.0 to 63.8 on MTEB. We provide [some random strings](https://huggingface.co/jxm/cde-small-v1/resolve/main/random_strings.txt) that worked well for us that can be used as a substitute for corpus sampling.
8768
+
8769
+
8770
  ## With Sentence Transformers
8771
 
8772
  <details open="">
 
8937
 
8938
  </details>
8939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8940
  ### Colab demo
8941
 
8942
  We've set up a short demo in a Colab notebook showing how you might use our model: