PyTorch
English
Tevatron
phi3_v
vidore
custom_code
MrLight commited on
Commit
f5c38c6
1 Parent(s): 62588ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -6
README.md CHANGED
@@ -11,18 +11,73 @@ library_name: Tevatron
11
  ---
12
 
13
  # DSE-Phi3-Docmatix-V1.0
14
- DSE is a bi-encoder that encodes document screenshots into dense vectors for document retrieval.
15
 
16
- Document Screenshot Embedding ([DSE](https://arxiv.org/abs/2406.11251)) proposes to encode documents in their original look to avoid tedious processes and information loss during content parsing.
17
- Specifically, DSE regards document screenshots as a unified input format that preserves all the information in a document (e.g., text, image and layout), encoding document (PDF, Webpage, Slides) directly into dense vector for document retrieval.
18
- `Tevatron/dse-phi3-docmatix-v1.0` is trained with the `Tevatron/docmatix-ir` dataset, a variant of `HuggingFaceM4/Docmatix` to train PDF retriever with Vision Language Model for open-domain question answering.
19
- Please see the dataset page of [docmatix-ir](https://huggingface.co/datasets/Tevatron/docmatix-ir/blob/main/README.md) for how we filter out questions that is not suitable for open domain retrieval and how we conduct hard negative mining with DSE-Phi3-V1.0 to get high query bi-encoder training data.
20
 
21
- ## How to use the model?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ### Encode Text Query
24
 
 
 
 
 
 
 
 
25
  ### Encode Document Screenshot
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ### Encode Document Text
 
 
 
 
 
 
 
28
 
 
 
 
 
11
  ---
12
 
13
  # DSE-Phi3-Docmatix-V1.0
 
14
 
15
+ DSE-Phi3-Docmatix-V1.0 is a bi-encoder model designed to encode document screenshots into dense vectors for document retrieval. The Document Screenshot Embedding ([DSE](https://arxiv.org/abs/2406.11251)) approach captures documents in their original visual format, preserving all information such as text, images, and layout, thus avoiding tedious parsing and potential information loss.
 
 
 
16
 
17
+ The model, `Tevatron/dse-phi3-docmatix-v1.0`, is trained using the `Tevatron/docmatix-ir` dataset, a variant of `HuggingFaceM4/Docmatix` specifically adapted for training PDF retrievers with Vision Language Models in open-domain question answering scenarios. For more information on dataset filtering and hard negative mining, refer to the [docmatix-ir dataset page](https://huggingface.co/datasets/Tevatron/docmatix-ir/blob/main/README.md).
18
+
19
+ ## How to Use the Model
20
+
21
+ ### Load the Model and Processor
22
+
23
+ ```python
24
+ import torch
25
+ from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
26
+
27
+ processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
28
+ config = AutoConfig.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, use_cache=False)
29
+ model = AutoModelForCausalLM.from_pretrained('Tevatron/dse-phi3-docmatix-v1.0', trust_remote_code=True, config=config, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).to('cuda:0')
30
+
31
+ def get_embedding(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
32
+ sequence_lengths = attention_mask.sum(dim=1) - 1
33
+ bs = last_hidden_state.shape[0]
34
+ reps = last_hidden_state[torch.arange(bs, device=last_hidden_state.device), sequence_lengths]
35
+ reps = torch.nn.functional.normalize(reps, p=2, dim=-1)
36
+ return reps
37
+ ```
38
 
39
  ### Encode Text Query
40
 
41
+ ```python
42
+ queries = ["query: Where can we find Llama?", "query: What is the LLaMA model?"]
43
+ query_inputs = processor(queries, return_tensors="pt", padding="longest", max_length=128, truncation=True).to('cuda:0')
44
+ output = model(**query_inputs, return_dict=True, output_hidden_states=True)
45
+ query_embeddings = get_embedding(output.hidden_states[-1], query_inputs["attention_mask"])
46
+ ```
47
+
48
  ### Encode Document Screenshot
49
 
50
+ ```python
51
+ from PIL import Image
52
+
53
+ passage_image1 = Image.open("path/to/your/image1.png")
54
+ passage_image2 = Image.open("path/to/your/image2.png")
55
+ passage_images = [passage_image1, passage_image2]
56
+ passage_prompts = ["\nWhat is shown in this image?</s>", "\nWhat is shown in this image?</s>"]
57
+
58
+ passage_inputs = processor(passage_prompts, images=passage_images, return_tensors="pt", padding="longest", max_length=4096, truncation=True).to('cuda:0')
59
+ output = model(**passage_inputs, return_dict=True, output_hidden_states=True)
60
+ doc_embeddings = get_embedding(output.hidden_states[-1], passage_inputs["attention_mask"])
61
+ ```
62
+
63
+ ### Compute Similarity
64
+
65
+ ```python
66
+ from torch.nn.functional import cosine_similarity
67
+
68
+ similarities = cosine_similarity(query_embeddings, doc_embeddings)
69
+ print(similarities)
70
+ ```
71
+
72
  ### Encode Document Text
73
+ This DSE checkpoint is warm-up with `Tevatron/msmarco-passage-aug`, thus the model can also effectively encode document as text input.
74
+ ```python
75
+ passage_prompts = ["Llama is in Aferica</s>", "LLaMA is an LLM released by Meta.</s>"]
76
+
77
+ passage_inputs = processor(passage_prompts, images=None, return_tensors="pt", padding="longest", max_length=4096, truncation=True).to('cuda:0')
78
+ output = model(**passage_inputs, return_dict=True, output_hidden_states=True)
79
+ doc_embeddings = get_embedding(output.hidden_states[-1], passage_inputs["attention_mask"])
80
 
81
+ similarities = cosine_similarity(query_embeddings, doc_embeddings)
82
+ print(similarities)
83
+ ```