BAAI
/

Shitao commited on
Commit
65a3d2d
·
1 Parent(s): 58de04a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -24
README.md CHANGED
@@ -1,12 +1,41 @@
 
1
 
2
 
3
- # flag-text-embedding-chinese
 
 
4
 
5
- Map any text to a 1024-dimensional dense vector space and can be used for tasks like retrieval, classification, clustering, or semantic search.
6
 
 
 
 
 
 
 
7
 
8
 
9
- ## Usage (Sentence-Transformers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
12
 
@@ -19,51 +48,51 @@ Then you can use the model like this:
19
  ```python
20
  from sentence_transformers import SentenceTransformer
21
  sentences = ["样例数据-1", "样例数据-2"]
22
-
23
- model = SentenceTransformer('Shitao/flag-text-embedding-chinese')
24
  embeddings = model.encode(sentences, normalize_embeddings=True)
25
  print(embeddings)
26
  ```
27
 
28
 
29
-
30
- ## Usage (HuggingFace Transformers)
31
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
32
 
33
  ```python
34
  from transformers import AutoTokenizer, AutoModel
35
  import torch
36
-
37
-
38
  # Sentences we want sentence embeddings for
39
  sentences = ["样例数据-1", "样例数据-2"]
40
-
41
  # Load model from HuggingFace Hub
42
- tokenizer = AutoTokenizer.from_pretrained('Shitao/flag-text-embedding-chinese')
43
- model = AutoModel.from_pretrained('Shitao/flag-text-embedding-chinese')
44
-
45
  # Tokenize sentences
46
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
47
-
48
  # Compute token embeddings
49
  with torch.no_grad():
50
  model_output = model(**encoded_input)
51
  # Perform pooling. In this case, cls pooling.
52
  sentence_embeddings = model_output[0][:, 0]
53
-
 
54
  print("Sentence embeddings:")
55
  print(sentence_embeddings)
56
  ```
57
 
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- ## Evaluation Results
61
-
62
- For an automated evaluation of this model, see the *Chinese Embedding Benchmark*: [link]()
63
-
64
-
65
-
66
-
67
- ## Citing & Authors
68
 
69
- <!--- Describe where people can find more information -->
 
1
+ # baai-general-embedding-large-zh-instruction
2
 
3
 
4
+ Map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
5
+ It also can be used in vector databases for LLMs.
6
+ For more details please refer to our GitHub: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)
7
 
 
8
 
9
+ ## Model List
10
+ | Model | Language | Description | query instruction for retrieval |
11
+ |:-------------------------------|:--------:| :--------:| :--------:|
12
+ | [BAAI/baai-general-embedding-large-en-instruction](https://huggingface.co/BAAI/baai-general-embedding-large-en-instruction) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
13
+ | [BAAI/baai-general-embedding-large-zh-instruction](https://huggingface.co/BAAI/baai-general-embedding-large-zh-instruction) | Chinese | rank **1st** in [C-MTEB]() bechmark | `为这个句子生成表示以用于检索相关文章:` |
14
+ | [BAAI/baai-general-embedding-large-zh](https://huggingface.co/BAAI/baai-general-embedding-large-zh) | Chinese | rank **2nd** in [C-MTEB]() bechmark | -- |
15
 
16
 
17
+ ## Evaluation Results
18
+
19
+ - **C-MTEB**:
20
+ We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
21
+ More details and evaluation scripts see [evaluation](evaluation/README.md).
22
+
23
+ | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
24
+ |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
25
+ | [**baai-general-embedding-large-zh-instruction**](https://huggingface.co/BAAI/baai-general-embedding-large-zh-instruction) | 1024 | **63.84** | **71.53** | **53.23** | **78.94** | 72.26 | 62.33 | 48.39 |
26
+ | [baai-general-embedding-large-zh](https://huggingface.co/BAAI/baai-general-embedding-large-zh) | 1024 | 63.62 | 70.55 | 50.98 | 76.77 | **72.49** | **65.63** | **50.01** |
27
+ | [m3e-base](https://huggingface.co/moka-ai/m3e-base) | 768 | 57.10 |56.91 | 48.15 | 63.99 | 70.28 | 59.34 | 47.68 |
28
+ | [m3e-large](https://huggingface.co/moka-ai/m3e-large) | 1024 | 57.05 |54.75 | 48.64 | 64.3 | 71.22 | 59.66 | 48.88 |
29
+ | [text-embedding-ada-002(OpenAI)](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) | 1536 | 53.02 | 52.0 | 40.61 | 69.56 | 67.38 | 54.28 | 45.68 |
30
+ | [luotuo](https://huggingface.co/silk-road/luotuo-bert-medium) | 1024 | 49.37 | 44.4 | 39.41 | 66.62 | 65.29 | 49.25 | 44.39 |
31
+ | [text2vec](https://huggingface.co/shibing624/text2vec-base-chinese) | 768 | 47.63 | 38.79 | 41.71 | 67.41 | 65.18 | 49.45 | 37.66 |
32
+ | [text2vec-large](https://huggingface.co/GanymedeNil/text2vec-large-chinese) | 1024 | 47.36 | 41.94 | 41.98 | 70.86 | 63.42 | 49.16 | 30.02 |
33
+
34
+
35
+
36
+ ## Usage
37
+
38
+ ### Sentence-Transformers
39
 
40
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
41
 
 
48
  ```python
49
  from sentence_transformers import SentenceTransformer
50
  sentences = ["样例数据-1", "样例数据-2"]
51
+ model = SentenceTransformer('BAAI/baai-general-embedding-large-zh-instruction')
 
52
  embeddings = model.encode(sentences, normalize_embeddings=True)
53
  print(embeddings)
54
  ```
55
 
56
 
57
+ ### HuggingFace Transformers
 
58
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
59
 
60
  ```python
61
  from transformers import AutoTokenizer, AutoModel
62
  import torch
 
 
63
  # Sentences we want sentence embeddings for
64
  sentences = ["样例数据-1", "样例数据-2"]
 
65
  # Load model from HuggingFace Hub
66
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/baai-general-embedding-large-zh-instruction')
67
+ model = AutoModel.from_pretrained('BAAI/baai-general-embedding-large-zh-instruction')
 
68
  # Tokenize sentences
69
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
70
  # Compute token embeddings
71
  with torch.no_grad():
72
  model_output = model(**encoded_input)
73
  # Perform pooling. In this case, cls pooling.
74
  sentence_embeddings = model_output[0][:, 0]
75
+ # normalize embeddings
76
+ sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
77
  print("Sentence embeddings:")
78
  print(sentence_embeddings)
79
  ```
80
 
81
 
82
+ ### Retrieval Task
83
+ For retrieval task, when you use the model whose name ends with `-instruction`
84
+ each query should start with a instruction.
85
+ ```python
86
+ from sentence_transformers import SentenceTransformer
87
+ queries = ["手机开不了机怎么办?"]
88
+ passages = ["样例段落-1", "样例段落-2"]
89
+ instruction = "为这个句子生成表示以用于检索相关文章:"
90
+ model = SentenceTransformer('BAAI/baai-general-embedding-large-zh-instruction')
91
+ q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
92
+ p_embeddings = model.encode(passages, normalize_embeddings=True)
93
+ scores = q_embeddings @ p_embeddings.T
94
+ ```
95
 
96
+ ## Limitations
97
+ This model only works for Chinese texts and long texts will be truncated to a maximum of 512 tokens.
 
 
 
 
 
 
98