|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- klue |
|
language: |
|
- ko |
|
--- |
|
|
|
๋ณธ ๋ชจ๋ธ์ multi-task loss (MultipleNegativeLoss -> AnglELoss) ๋ก, KlueNLI ๋ฐ KlueSTS ๋ฐ์ดํฐ๋ก ํ์ต๋์์ต๋๋ค. ํ์ต ์ฝ๋๋ ๋ค์ [Github hyperlink](https://github.com/comchobo/SFT_sent_emb?tab=readme-ov-file)์์ ๋ณด์ค ์ ์์ต๋๋ค. |
|
|
|
## Usage (Huggingface inference API) |
|
|
|
```python |
|
import requests |
|
|
|
API_URL = "https://api-inference.huggingface.co/models/sorryhyun/sentence-embedding-klue-large" |
|
headers = {"Authorization": "your_HF_token"} |
|
|
|
def query(payload): |
|
response = requests.post(API_URL, headers=headers, json=payload) |
|
return response.json() |
|
|
|
|
|
output = query({ |
|
"inputs": { |
|
"source_sentence": "์ข์์, ์ถ์ฒ, ์๋ฆผ์ค์ ๊น์ง", |
|
"sentences": [ |
|
"์ข์์ ๋๋ฌ์ฃผ์ธ์!!", |
|
"์ข์์, ์ถ์ฒ ๋ฑ ์ ํฌ๋ฒ๋ค์ด ์ข์ํด์", |
|
"์๋ฆผ์ค์ ์ ๋๋ฌ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค." |
|
] |
|
}, |
|
}) |
|
if __name__ == '__main__': |
|
print(output) |
|
``` |
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel, DataCollatorWithPadding |
|
import torch |
|
from torch.utils.data import DataLoader |
|
|
|
device = torch.device('cuda') |
|
|
|
# Sentences we want sentence embeddings for |
|
sentences = ['This is an example sentence', 'Each sentence is converted'] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}') |
|
collator = DataCollatorWithPadding(tokenizer) |
|
model = AutoModel.from_pretrained('{MODEL_NAME}').to(device) |
|
|
|
tokenized_data = tokenizer(sentences, padding=True, truncation=True) |
|
tokenized_data = tokenized_data.remove_columns('text') |
|
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True, collate_fn=collator) |
|
all_outputs = torch.zeros((len(tokenized_data), 1024)).to(device) |
|
start_idx = 0 |
|
|
|
# I used mean-pool method for sentence representation |
|
with torch.no_grad(): |
|
for inputs in tqdm(dataloader): |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
representations, _ = self.model(**inputs, return_dict=False) |
|
attention_mask = inputs["attention_mask"] |
|
input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype)) |
|
summed = torch.sum(representations * input_mask_expanded, 1) |
|
sum_mask = input_mask_expanded.sum(1) |
|
sum_mask = torch.clamp(sum_mask, min=1e-9) |
|
end_idx = start_idx + representations.shape[0] |
|
all_outputs[start_idx:end_idx] = (summed / sum_mask) |
|
start_idx = end_idx |
|
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
| Organization | Backbone Model | KlueSTS average | KorSTS average | |
|
| -------- | ------- | ------- | ------- | |
|
| team-lucid | DeBERTa-base | 54.15 | 29.72 | |
|
| monologg | Electra-base | 66.97 | 40.98 | |
|
| LMkor | Electra-base | 70.98 | 43.09 | |
|
| deliciouscat | DeBERTa-base | - | 67.65 | |
|
| BM-K | Roberta-base | 82.93 | **85.77** | |
|
| Klue | Roberta-large | **86.71** | 71.70 | |
|
| Klue (Hyperparameter searched) | Roberta-large | 86.21 | 75.54 | |
|
|
|
๊ธฐ์กด ํ๊ตญ์ด ๋ฌธ์ฅ ์๋ฒ ๋ฉ ๋ชจ๋ธ์ mnli, snli ๋ฑ ์์ด ๋ฐ์ดํฐ์
์ ๊ธฐ๊ณ๋ฒ์ญํ์ฌ ํ์ต๋ ์ ์ ์ฐธ๊ณ ์ผ์ Klue ๋ฐ์ดํฐ์
์ผ๋ก ๋์ ํ์ตํด ๋ณด์์ต๋๋ค. |
|
|
|
๊ทธ ๊ฒฐ๊ณผ, Klue-Roberta-large ๋ชจ๋ธ ๊ธฐ๋ฐ์ผ๋ก ํ์ตํ์ ๊ฒฝ์ฐ KlueSTS ๋ฐ KorSTS ํ
์คํธ์
์ ๋ชจ๋์ ๋ํด ์ค์ํ ์ฑ๋ฅ์ ๋ณด์ฌ, ์ข ๋ elaborateํ representation์ ํ์ฑํ๋ ๊ฒ์ผ๋ก ์ฌ๋ฃํ์ต๋๋ค. |
|
|
|
๋ค๋ง ํ๊ฐ ์์น๋ ํ์ดํผํ๋ผ๋ฏธํฐ ์ธํ
, ์๋ ๋๋ฒ ๋ฑ์ผ๋ก ํฌ๊ฒ ๋ฌ๋ผ์ง ์ ์์ผ๋ฏ๋ก ์ฐธ๊ณ ํ์๊ธธ ๋ฐ๋๋๋ค. |
|
|
|
## Training |
|
NegativeRank loss -> simcse loss ๋ก ํ์ตํ์ต๋๋ค. |
|
|
|
|