File size: 3,810 Bytes
79fe57b 148fa52 04a6a48 79fe57b 148fa52 6a1ff6d 5c63d1d 148fa52 36a21d1 148fa52 36a21d1 fb5bffc 148fa52 36a21d1 fb5bffc 148fa52 36a21d1 fb5bffc 148fa52 04a6a48 148fa52 fb5bffc 04a6a48 fb5bffc 148fa52 04a6a48 1243a48 04a6a48 148fa52 04a6a48 148fa52 04a6a48 148fa52 04a6a48 148fa52 04a6a48 148fa52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
license: cc-by-sa-4.0
datasets:
- klue
language:
- ko
---
๋ณธ ๋ชจ๋ธ์ multi-task loss (MultipleNegativeLoss -> AnglELoss) ๋ก, KlueNLI ๋ฐ KlueSTS ๋ฐ์ดํฐ๋ก ํ์ต๋์์ต๋๋ค. ํ์ต ์ฝ๋๋ ๋ค์ [Github hyperlink](https://github.com/comchobo/SFT_sent_emb?tab=readme-ov-file)์์ ๋ณด์ค ์ ์์ต๋๋ค.
## Usage (Huggingface inference API)
```python
import requests
API_URL = "https://api-inference.huggingface.co/models/sorryhyun/sentence-embedding-klue-large"
headers = {"Authorization": "your_HF_token"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": {
"source_sentence": "์ข์์, ์ถ์ฒ, ์๋ฆผ์ค์ ๊น์ง",
"sentences": [
"์ข์์ ๋๋ฌ์ฃผ์ธ์!!",
"์ข์์, ์ถ์ฒ ๋ฑ ์ ํฌ๋ฒ๋ค์ด ์ข์ํด์",
"์๋ฆผ์ค์ ์ ๋๋ฌ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค."
]
},
})
if __name__ == '__main__':
print(output)
```
## Usage (HuggingFace Transformers)
```python
from transformers import AutoTokenizer, AutoModel, DataCollatorWithPadding
import torch
from torch.utils.data import DataLoader
device = torch.device('cuda')
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
collator = DataCollatorWithPadding(tokenizer)
model = AutoModel.from_pretrained('{MODEL_NAME}').to(device)
tokenized_data = tokenizer(sentences, padding=True, truncation=True)
tokenized_data = tokenized_data.remove_columns('text')
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True, collate_fn=collator)
all_outputs = torch.zeros((len(tokenized_data), 1024)).to(device)
start_idx = 0
# I used mean-pool method for sentence representation
with torch.no_grad():
for inputs in tqdm(dataloader):
inputs = {k: v.to(device) for k, v in inputs.items()}
representations, _ = self.model(**inputs, return_dict=False)
attention_mask = inputs["attention_mask"]
input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype))
summed = torch.sum(representations * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
end_idx = start_idx + representations.shape[0]
all_outputs[start_idx:end_idx] = (summed / sum_mask)
start_idx = end_idx
```
## Evaluation Results
| Organization | Backbone Model | KlueSTS average | KorSTS average |
| -------- | ------- | ------- | ------- |
| team-lucid | DeBERTa-base | 54.15 | 29.72 |
| monologg | Electra-base | 66.97 | 40.98 |
| LMkor | Electra-base | 70.98 | 43.09 |
| deliciouscat | DeBERTa-base | - | 67.65 |
| BM-K | Roberta-base | 82.93 | **85.77** |
| Klue | Roberta-large | **86.71** | 71.70 |
| Klue (Hyperparameter searched) | Roberta-large | 86.21 | 75.54 |
๊ธฐ์กด ํ๊ตญ์ด ๋ฌธ์ฅ ์๋ฒ ๋ฉ ๋ชจ๋ธ์ mnli, snli ๋ฑ ์์ด ๋ฐ์ดํฐ์
์ ๊ธฐ๊ณ๋ฒ์ญํ์ฌ ํ์ต๋ ์ ์ ์ฐธ๊ณ ์ผ์ Klue ๋ฐ์ดํฐ์
์ผ๋ก ๋์ ํ์ตํด ๋ณด์์ต๋๋ค.
๊ทธ ๊ฒฐ๊ณผ, Klue-Roberta-large ๋ชจ๋ธ ๊ธฐ๋ฐ์ผ๋ก ํ์ตํ์ ๊ฒฝ์ฐ KlueSTS ๋ฐ KorSTS ํ
์คํธ์
์ ๋ชจ๋์ ๋ํด ์ค์ํ ์ฑ๋ฅ์ ๋ณด์ฌ, ์ข ๋ elaborateํ representation์ ํ์ฑํ๋ ๊ฒ์ผ๋ก ์ฌ๋ฃํ์ต๋๋ค.
๋ค๋ง ํ๊ฐ ์์น๋ ํ์ดํผํ๋ผ๋ฏธํฐ ์ธํ
, ์๋ ๋๋ฒ ๋ฑ์ผ๋ก ํฌ๊ฒ ๋ฌ๋ผ์ง ์ ์์ผ๋ฏ๋ก ์ฐธ๊ณ ํ์๊ธธ ๋ฐ๋๋๋ค.
## Training
NegativeRank loss -> simcse loss ๋ก ํ์ตํ์ต๋๋ค.
|