File size: 3,810 Bytes
79fe57b
148fa52
 
 
 
 
 
 
04a6a48
 
 
 
 
79fe57b
148fa52
6a1ff6d
5c63d1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148fa52
 
 
 
36a21d1
148fa52
36a21d1
 
fb5bffc
148fa52
 
 
 
 
 
36a21d1
fb5bffc
148fa52
36a21d1
 
 
 
fb5bffc
148fa52
04a6a48
148fa52
fb5bffc
04a6a48
 
 
 
 
 
 
fb5bffc
 
 
148fa52
 
 
 
 
 
04a6a48
 
 
1243a48
04a6a48
 
 
 
 
148fa52
04a6a48
148fa52
04a6a48
148fa52
04a6a48
148fa52
04a6a48
 
148fa52
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
license: cc-by-sa-4.0
datasets:
- klue
language:
- ko
---

๋ณธ ๋ชจ๋ธ์€ multi-task loss (MultipleNegativeLoss -> AnglELoss) ๋กœ, KlueNLI ๋ฐ KlueSTS ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ฝ”๋“œ๋Š” ๋‹ค์Œ [Github hyperlink](https://github.com/comchobo/SFT_sent_emb?tab=readme-ov-file)์—์„œ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

## Usage (Huggingface inference API)

```python
import requests

API_URL = "https://api-inference.huggingface.co/models/sorryhyun/sentence-embedding-klue-large"
headers = {"Authorization": "your_HF_token"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()


output = query({
    "inputs": {
        "source_sentence": "์ข‹์•„์š”, ์ถ”์ฒœ, ์•Œ๋ฆผ์„ค์ •๊นŒ์ง€",
        "sentences": [
            "์ข‹์•„์š” ๋ˆŒ๋Ÿฌ์ฃผ์„ธ์š”!!",
            "์ข‹์•„์š”, ์ถ”์ฒœ ๋“ฑ ์œ ํˆฌ๋ฒ„๋“ค์ด ์ข‹์•„ํ•ด์š”",
            "์•Œ๋ฆผ์„ค์ •์„ ๋ˆŒ๋Ÿฌ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค."
        ]
    },
})
if __name__ == '__main__':
  print(output)
```

## Usage (HuggingFace Transformers)

```python
from transformers import AutoTokenizer, AutoModel, DataCollatorWithPadding
import torch
from torch.utils.data import DataLoader

device = torch.device('cuda')

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
collator = DataCollatorWithPadding(tokenizer)
model = AutoModel.from_pretrained('{MODEL_NAME}').to(device)

tokenized_data = tokenizer(sentences, padding=True, truncation=True)
tokenized_data = tokenized_data.remove_columns('text')
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True, collate_fn=collator)
all_outputs = torch.zeros((len(tokenized_data), 1024)).to(device)
start_idx = 0

# I used mean-pool method for sentence representation
with torch.no_grad():
  for inputs in tqdm(dataloader):
    inputs = {k: v.to(device) for k, v in inputs.items()}
    representations, _ = self.model(**inputs, return_dict=False)
    attention_mask = inputs["attention_mask"]
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype))
    summed = torch.sum(representations * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    end_idx = start_idx + representations.shape[0]
    all_outputs[start_idx:end_idx] = (summed / sum_mask)
    start_idx = end_idx

```


## Evaluation Results

| Organization | Backbone Model | KlueSTS average | KorSTS average |
| -------- | ------- | ------- | ------- |
| team-lucid | DeBERTa-base | 54.15 | 29.72 |
| monologg | Electra-base | 66.97 | 40.98 |
| LMkor | Electra-base | 70.98 | 43.09 |
| deliciouscat | DeBERTa-base | - | 67.65 |
| BM-K    | Roberta-base | 82.93 | **85.77** |
| Klue    | Roberta-large | **86.71** | 71.70 |
| Klue (Hyperparameter searched) | Roberta-large | 86.21 | 75.54 |

๊ธฐ์กด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์€ mnli, snli ๋“ฑ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๊ณ„๋ฒˆ์—ญํ•˜์—ฌ ํ•™์Šต๋œ ์ ์„ ์ฐธ๊ณ ์‚ผ์•„ Klue ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋Œ€์‹  ํ•™์Šตํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ, Klue-Roberta-large ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๊ฒฝ์šฐ KlueSTS ๋ฐ KorSTS ํ…Œ์ŠคํŠธ์…‹์— ๋ชจ๋‘์— ๋Œ€ํ•ด ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ์ข€ ๋” elaborateํ•œ representation์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‚ฌ๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ํ‰๊ฐ€ ์ˆ˜์น˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ธํŒ…, ์‹œ๋“œ ๋„˜๋ฒ„ ๋“ฑ์œผ๋กœ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฐธ๊ณ ํ•˜์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

## Training
NegativeRank loss -> simcse loss ๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.