bclavie commited on
Commit
76cda8c
·
verified ·
1 Parent(s): 7ef93b0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - ColBERT
7
+ - RAGatouille
8
+ - passage-retrieval
9
+ ---
10
+
11
+ # answerai-colbert-small-v1
12
+
13
+ **answerai-colbert-small-v1** is a new, proof-of-concept model by [Answer.AI](answerai-colbert-small-v1), showing the strong performance multi-vector models with the new [JaColBERTv2.5 training recipe](https://arxiv.org/abs/2407.20750) and some extra tweaks can reach, even with just **33 million parameters**.
14
+
15
+ While being MiniLM-sized, it outperforms all previous similarly-sized models on common benchmarks, and even outperforms much larger popular models such as e5-large-v2 or bge-base-en-v1.5.
16
+
17
+ For more information about this model or how it was trained, head over to the [announcement blogpost](https://www.answer.ai/posts/2024-08-13-small-but-mighty-colbert.html).
18
+
19
+ ## Results
20
+
21
+ ### Against single-vector models
22
+
23
+ ![](https://www.answer.ai/posts/images/minicolbert/small_results.png)
24
+
25
+
26
+ | Dataset / Model | answer-colbert-s | snowflake-s | bge-small-en | bge-base-en |
27
+ |:-----------------|:-----------------:|:-------------:|:-------------:|:-------------:|
28
+ | **Size** | 33M (1x) | 33M (1x) | 33M (1x) | **109M (3.3x)** |
29
+ | **BEIR AVG** | **53.79** | 51.99 | 51.68 | 53.25 |
30
+ | **FiQA2018** | **41.15** | 40.65 | 40.34 | 40.65 |
31
+ | **HotpotQA** | **76.11** | 66.54 | 69.94 | 72.6 |
32
+ | **MSMARCO** | **43.5** | 40.23 | 40.83 | 41.35 |
33
+ | **NQ** | **59.1** | 50.9 | 50.18 | 54.15 |
34
+ | **TRECCOVID** | **84.59** | 80.12 | 75.9 | 78.07 |
35
+ | **ArguAna** | 50.09 | 57.59 | 59.55 | **63.61** |
36
+ | **ClimateFEVER**| **33.07** | 35.2 | 31.84 | 31.17 |
37
+ | **CQADupstackRetrieval** | 38.75 | 39.65 | 39.05 | **42.35** |
38
+ | **DBPedia** | **45.58** | 41.02 | 40.03 | 40.77 |
39
+ | **FEVER** | **90.96** | 87.13 | 86.64 | 86.29 |
40
+ | **NFCorpus** | **37.3** | 34.92 | 34.3 | 37.39 |
41
+ | **QuoraRetrieval** | 87.72 | 88.41 | **88.78** | 88.9 |
42
+ | **SCIDOCS** | 18.42 | **21.82** | 20.52 | 21.73 |
43
+ | **SciFact** | **74.77** | 72.22 | 71.28 | 74.04 |
44
+ | **Touche2020** | 25.69 | 23.48 | **26.04** | 25.7 |
45
+
46
+ ### Against ColBERTv2.0
47
+
48
+ | Dataset / Model | answerai-colbert-small-v1 | ColBERTv2.0 |
49
+ |:-----------------|:-----------------------:|:------------:|
50
+ | **BEIR AVG** | **53.79** | 50.02 |
51
+ | **DBPedia** | **45.58** | 44.6 |
52
+ | **FiQA2018** | **41.15** | 35.6 |
53
+ | **NQ** | **59.1** | 56.2 |
54
+ | **HotpotQA** | **76.11** | 66.7 |
55
+ | **NFCorpus** | **37.3** | 33.8 |
56
+ | **TRECCOVID** | **84.59** | 73.3 |
57
+ | **Touche2020** | 25.69 | **26.3** |
58
+ | **ArguAna** | **50.09** | 46.3 |
59
+ | **ClimateFEVER**| **33.07** | 17.6 |
60
+ | **FEVER** | **90.96** | 78.5 |
61
+ | **QuoraRetrieval** | **87.72** | 85.2 |
62
+ | **SCIDOCS** | **18.42** | 15.4 |
63
+ | **SciFact** | **74.77** | 69.3 |
64
+
65
+ ## Usage
66
+
67
+ ### Installation
68
+
69
+ This model was designed with the upcoming RAGatouille overhaul in mind. However, it's compatible with all recent ColBERT implementations!
70
+
71
+ To use it, you can either use the Stanford ColBERT library, or RAGatouille. You can install both or either by simply running.
72
+
73
+ ```sh
74
+ pip install --upgrade ragatouille
75
+ pip install --upgrade colbert-ai
76
+ ```
77
+
78
+ If you're interested in using this model as a re-ranker (it vastly outperforms cross-encoders its size!), you can do so via the [rerankers](https://github.com/AnswerDotAI/rerankers) library:
79
+ ```sh
80
+ pip install --upgrade rerankers[transformers]
81
+ ```
82
+
83
+ ### Rerankers
84
+
85
+ ```python
86
+ from rerankers import Reranker
87
+
88
+ ranker = Reranker("answerdotai/answerai-colbert-small-v1", model_type='colbert')
89
+ docs = ['Hayao Miyazaki is a Japanese director, born on [...]', 'Walt Disney is an American author, director and [...]', ...]
90
+ query = 'Who directed spirited away?'
91
+ ranker.rank(query=query, docs=docs)
92
+ ```
93
+
94
+ ### RAGatouille
95
+
96
+ ### Stanford ColBERT
97
+
98
+ #### Indexing
99
+
100
+ ```python
101
+ from colbert import Indexer
102
+ from colbert.infra import Run, RunConfig, ColBERTConfig
103
+
104
+ INDEX_NAME = "DEFINE_HERE"
105
+
106
+ if __name__ == "__main__":
107
+ config = ColBERTConfig(
108
+ doc_maxlen=512,
109
+ nbits=2
110
+ )
111
+ indexer = Indexer(
112
+ checkpoint="answerdotai/answerai-colbert-small-v1",
113
+ config=config,
114
+ )
115
+ docs = ['Hayao Miyazaki is a Japanese director, born on [...]', 'Walt Disney is an American author, director and [...]', ...]
116
+
117
+ indexer.index(name=INDEX_NAME, collection=docs)
118
+ ```
119
+
120
+ #### Querying
121
+
122
+ ```python
123
+ from colbert import Searcher
124
+ from colbert.infra import Run, RunConfig, ColBERTConfig
125
+
126
+ INDEX_NAME = "THE_INDEX_YOU_CREATED"
127
+ k = 10
128
+
129
+ if __name__ == "__main__":
130
+ config = ColBERTConfig(
131
+ query_maxlen=32 # Adjust as needed, we recommend the nearest higher multiple of 16 to your query
132
+ )
133
+ searcher = Searcher(
134
+ index=index_name,
135
+ config=config
136
+ )
137
+ query = 'Who directed spirited away?'
138
+ results = searcher.search(query, k=k)
139
+ ```
140
+
141
+
142
+ #### Extracting Vectors
143
+
144
+ Finally, if you want to extract individula vectors, you can use the model this way:
145
+
146
+
147
+ ```python
148
+ from colbert.modeling.checkpoint import Checkpoint
149
+
150
+ ckpt = Checkpoint(answerdotai/answerai-colbert-small-v1", colbert_config=ColBERTConfig())
151
+ embedded_query = ckpt.queryFromText(["Who dubs Howl's in English?"], bsize=16)
152
+ ```