File size: 12,170 Bytes

5ffcadf
 
e1607fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c134165
e1607fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c134165
 
 
 
 
 
 
3b026ec
641fe7e
a3a5d89
e86e62e
a3a5d89
3b026ec
5ffcadf
 
97f6a00
5ffcadf
272eab8
 
 
 
 
 
 
 
 
 
 
97f6a00
5ffcadf
165ae88
5ffcadf
97f6a00
5ffcadf
97f6a00
5ffcadf
6ed382f
 
 
 
 
 
 
 
0368c6a
 
6ed382f
 
d2818d3
 
 
 
 
 
 
 
 
 
 
 
 
 
20a6460
d2818d3
bc9cd3a
 
 
6ed382f
 
 
bc9cd3a
020c4a4
 
6ed382f
 
 
 
 
 
 
08d0055
6ed382f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08d0055
6ed382f
 
 
 
 
 
 
 
 
 
 
 
bc9cd3a
 
6ed382f
 
 
020c4a4
 
 
6ed382f
 
 
 
 
 
 
 
 
 
 
08d0055
6ed382f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08d0055
6ed382f
 
7a09fc3
6ed382f
 
 
 
 
 
 
 
 
fdcd67e
6ed382f
 
bc9cd3a
 
20a6460
 
 
020c4a4
 
 
20a6460
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc9cd3a
 
0126e86
97f6a00
5ffcadf
97f6a00
bdbafee
 
 
 
 
 
 
 
 
 
 
97f6a00
5ffcadf
97f6a00
5ffcadf
97f6a00
5ffcadf
bdbafee
 
 
5ffcadf
d2818d3
 
bdbafee
e1607fd
d2818d3
 
e1607fd

---
library_name: transformers
license: apache-2.0
language:
- en
- zh
- es
- de
- ar
- ru
- ja
- ko
- hi
- sk
- vi
- tr
- fi
- id
- fa
- 'no'
- th
- sv
- pt
- da
- bn
- te
- ro
- it
- fr
- nl
- sw
- pl
- hu
- cs
- el
- uk
- mr
- ta
- tl
- bg
- lt
- ur
- he
- gu
- kn
- am
- kk
- hr
- uz
- jv
- ca
- az
- ms
- sr
- sl
- yo
- lv
- is
- ha
- ka
- et
- bs
- hy
- ml
- pa
- mt
- km
- sq
- or
- as
- my
- mn
- af
- be
- ga
- mk
- cy
- gl
- ceb
- la
- yi
- lb
- tg
- gd
- ne
- ps
- eu
- ky
- ku
- si
- ht
- eo
- lo
- fy
- sd
- mg
- so
- ckb
- su
- nn
datasets:
- lightblue/reranker_continuous_filt_max7_train
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
pipeline_tag: text-generation
tags:
- reranker
widget:
- text: "<<<Query>>>\nHow many languages has LB-Reranker been trained on?\n\n\n<<<Context>>>\nLB-Reranker has been trained on more than 95 languages."
  example_title: Positive example (7/7)
- text: "<<<Query>>>\nHow many languages has LB-Reranker been trained on?\n\n\n<<<Context>>>\nAA-Reranker is applicable to a broad range of use cases."
  example_title: Negative example (2/7)

---

# LB Reranker v1.0

<div style="width: 100%; height: 160px; 
            display: flex; align-items: center; 
            justify-content: center; 
            border: 8px solid black; 
            font-size: 120px; font-weight: bold; 
            text-align: center;
            color: #438db8; 
            font-family: 'Helvetica Neue', sans-serif;">
  LBR
</div>

The LB Reranker has been trained to determine the relatedness of a given query to a piece of text, therefore allowing it to be used as a ranker or reranker in various retrieval-based tasks.

This model is fine-tuned from a [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co./Qwen/Qwen2.5-0.5B-Instruct) model checkpoint and was trained for roughly 5.5 hours using the 8 x L20 instance ([ecs.gn8is-8x.32xlarge](https://www.alibabacloud.com/help/en/ecs/user-guide/gpu-accelerated-compute-optimized-and-vgpu-accelerated-instance-families-1)) on [Alibaba Cloud](https://www.alibabacloud.com/).

The training data for this model can be found at [lightblue/reranker_continuous_filt_max7_train](https://huggingface.co./datasets/lightblue/reranker_continuous_filt_max7_train) and the code for generating this data as well as running the training of the model can be found on [our Github repo](https://github.com/lightblue-tech/lb-reranker).

Trained on data in over 95 languages, this model is applicable to a broad range of use cases.

This model has three main benefits over comparable rerankers.
1. It has shown slightly higher performance on evaluation benchmarks.
2. It has been trained on more languages than any previous model.
3. It is a simple Causal LM model trained to output a string between "1" and "7".

This last point means that this model can be used natively with many widely available inference packages, including vLLM and LMDeploy.
This in turns allows our reranker to benefit from improvements to inference as and when these packages release them.

Update: We have also found that this model works pretty well as a code snippet reranker too (P@1 of 96%)! See our [Colab](https://colab.research.google.com/drive/1ABL1xaarekLIlVJKbniYhXgYu6ZNwfBm?usp=sharing) for more details.

# How to use

The model was trained to expect an input such as:

```
<<<Query>>>
{your_query_here}

<<<Context>>>
{your_context_here}
```

And to output a string of a number between 1-7.

In order to make a continuous score that can be used for reranking query-context pairs (i.e. a method with few ties), we calculate the expectation value of the scores.

We include scripts to do this in vLLM, LMDeploy, and OpenAI (hosted for free on Huggingface):


<ul>
  <li><b>vLLM</b>

Install [vLLM](https://github.com/vllm-project/vllm/) using `pip install vllm`.

<details open>
  <summary>Show vLLM code</summary>
  
```python
from vllm import LLM, SamplingParams
import numpy as np

def make_reranker_input(t, q):
    return f"<<<Query>>>\n{q}\n\n<<<Context>>>\n{t}"

def make_reranker_inference_conversation(context, question):
    system_message = "Given a query and a piece of text, output a score of 1-7 based on how related the query is to the text. 1 means least related and 7 is most related."

    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": make_reranker_input(context, question)},
    ]

def get_prob(logprob_dict, tok_id):
    return np.exp(logprob_dict[tok_id].logprob) if tok_id in logprob_dict.keys() else 0

llm = LLM("lightblue/lb-reranker-v1.0")
sampling_params = SamplingParams(temperature=0.0, logprobs=14, max_tokens=1)
tok = llm.llm_engine.tokenizer.tokenizer
idx_tokens = [tok.encode(str(i))[0] for i in range(1, 8)]

query_texts = [
    ("What is the scientific name of apples?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the Chinese word for 'apple'?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the square root of 999?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
]

chats = [make_reranker_inference_conversation(c, q) for q, c in query_texts]
responses = llm.chat(chats, sampling_params)
probs = np.array([[get_prob(r.outputs[0].logprobs[0], y) for y in idx_tokens] for r in responses])

N = probs.shape[1]
M = probs.shape[0]
idxs = np.tile(np.arange(1, N + 1), M).reshape(M, N)

expected_vals = (probs * idxs).sum(axis=1)
print(expected_vals)
# [6.66570732 1.86686378 1.01102923]
```

</details></li>
  <li><b>LMDeploy</b>

Install [LMDeploy](https://github.com/InternLM/lmdeploy) using `pip install lmdeploy`.

<details>
  <summary>Show LMDeploy code</summary>
  
```python
# Un-comment this if running in a Jupyter notebook, Colab etc.
# import nest_asyncio
# nest_asyncio.apply()

from lmdeploy import GenerationConfig, ChatTemplateConfig, pipeline
import numpy as np

def make_reranker_input(t, q):
    return f"<<<Query>>>\n{q}\n\n<<<Context>>>\n{t}"

def make_reranker_inference_conversation(context, question):
    system_message = "Given a query and a piece of text, output a score of 1-7 based on how related the query is to the text. 1 means least related and 7 is most related."

    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": make_reranker_input(context, question)},
    ]

def get_prob(logprob_dict, tok_id):
    return np.exp(logprob_dict[tok_id]) if tok_id in logprob_dict.keys() else 0

pipe = pipeline(
    "lightblue/lb-reranker-v1.0",
    chat_template_config=ChatTemplateConfig(
                    model_name='qwen2d5',
                    capability='chat'
    )
)
tok = pipe.tokenizer.model
idx_tokens = [tok.encode(str(i))[0] for i in range(1, 8)]

query_texts = [
    ("What is the scientific name of apples?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the Chinese word for 'apple'?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the square root of 999?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
]

chats = [make_reranker_inference_conversation(c, q) for q, c in query_texts]
responses = pipe(
    chats, 
    gen_config=GenerationConfig(temperature=1.0, logprobs=14, max_new_tokens=1, do_sample=True)
)
probs = np.array([[get_prob(r.logprobs[0], y) for y in idx_tokens] for r in responses])

N = probs.shape[1]
M = probs.shape[0]
idxs = np.tile(np.arange(1, N + 1), M).reshape(M, N)

expected_vals = (probs * idxs).sum(axis=1)
print(expected_vals)
# [6.66415229 1.84342025 1.01133205]
```

</details></li>
  <li><b>OpenAI (Hosted on Huggingface)</b>

Install [openai](https://github.com/openai/openai-python) using `pip install openai`.

<details>
  <summary>Show OpenAI + Huggingface Inference code</summary>
  
```python
from openai import OpenAI
import numpy as np
from multiprocessing import Pool
from tqdm.auto import tqdm

client = OpenAI(
	base_url="https://api-inference.huggingface.co/v1/",
	api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Change this to an access token from https://huggingface.co./settings/tokens
)

def make_reranker_input(t, q):
    return f"<<<Query>>>\n{q}\n\n<<<Context>>>\n{t}"

def make_reranker_inference_conversation(context, question):
    system_message = "Given a query and a piece of text, output a score of 1-7 based on how related the query is to the text. 1 means least related and 7 is most related."

    return [
        {"role": "system", "content": system_message},
        {"role": "user", "content": make_reranker_input(context, question)},
    ]

def get_reranker_score(context_question_tuple):
    question, context = context_question_tuple

    messages = make_reranker_inference_conversation(context, question)

    completion = client.chat.completions.create(
        model="lightblue/lb-reranker-0.5B-v1.0", 
        messages=messages,
        max_tokens=1,
        temperature=0.0,
        logprobs=True,
        top_logprobs=5, # Max allowed by the openai API as top_n_tokens must be >= 0 and <= 5. If this gets changed, fix to > 7.
    )

    logprobs = completion.choices[0].logprobs.content[0].top_logprobs

    calculated_score = sum([int(x.token) * np.exp(x.logprob) for x in logprobs])

    return calculated_score

query_texts = [
    ("What is the scientific name of apples?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the Chinese word for 'apple'?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
    ("What is the square root of 999?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
]

with Pool(processes=16) as p: # Allows for parallel processing
    expected_vals = list(tqdm(p.imap(get_reranker_score, query_texts), total=len(query_texts)))

print(expected_vals)
# [6.64866580, 1.85144404, 1.010719508]
```

</details></li>
</ul>	

# Evaluation

We perform an evaluation on 9 datasets from the [BEIR benchmark](https://github.com/beir-cellar/beir) that none of the evaluated models have been trained upon (to our knowledge).

* Arguana
* Dbpedia-entity
* Fiqa
* NFcorpus
* Scidocs
* Scifact
* Trec-covid-v2
* Vihealthqa
* Webis-touche2020

We evaluate on a subset of all queries (the first 250) to save evaluation time.

We find that our model performs similarly or better than many of the state-of-the-art reranker models in our evaluation, without compromising on inference speed.

We make our evaluation code and results available [on our Github](https://github.com/lightblue-tech/lb-reranker/blob/main/run_bier.ipynb).

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/xkNzCABFUmU7UmDXUduiz.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/P-XCA3TGHqDSX8k6c4hCE.png)

As we can see, this reranker attains greater IR evaluation metrics compared to the two benchmarks we include for all positions apart from @1.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/puhhWseBOcIyOEdW4L-B0.png)

We also show that our model is, on average, faster than the BGE reranker v2.

# License

We share this model under an Apache 2.0 license.

# Developed by

<a href="https://www.lightblue-tech.com">
<img src="https://www.lightblue-tech.com/wp-content/uploads/2023/08/color_%E6%A8%AA%E5%9E%8B-1536x469.png" alt="Lightblue technology logo" width="400"/>
</a>

This model was trained by Peter Devine ([ptrdvn](https://huggingface.co./ptrdvn)) for Lightblue