File size: 2,788 Bytes

79fe57b
148fa52
 
 
 
 
 
 
04a6a48
 
 
 
 
79fe57b
148fa52
04a6a48
148fa52
 
 
 
 
 
fb5bffc
148fa52
 
 
 
 
 
fb5bffc
148fa52
fb5bffc
 
 
 
148fa52
04a6a48
148fa52
fb5bffc
04a6a48
 
 
 
 
 
 
fb5bffc
 
 
148fa52
 
 
 
 
 
04a6a48
 
 
 
 
 
 
 
 
148fa52
04a6a48
148fa52
04a6a48
148fa52
04a6a48
148fa52
04a6a48
 
148fa52

---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
license: cc-by-sa-4.0
datasets:
- klue
language:
- ko
---

본 모델은 multi-task loss (MultipleNegativeLoss -> AnglELoss) 로 학습되었습니다.

## Usage (HuggingFace Transformers)

```python
from transformers import AutoTokenizer, AutoModel
import torch
device = torch.device('cuda')

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}').to(device)

tokenized_data = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
dataloader = DataLoader(tokenized_data, batch_size=batch_size, pin_memory=True)
all_outputs = torch.zeros((len(tokenized_data), self.hidden_size)).to(device)
start_idx = 0

# I used mean-pool method for sentence representation
with torch.no_grad():
  for inputs in tqdm(dataloader):
    inputs = {k: v.to(device) for k, v in inputs.items()}
    representations, _ = self.model(**inputs, return_dict=False)
    attention_mask = inputs["attention_mask"]
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(representations.size()).to(representations.dtype))
    summed = torch.sum(representations * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    end_idx = start_idx + representations.shape[0]
    all_outputs[start_idx:end_idx] = (summed / sum_mask)
    start_idx = end_idx

```


## Evaluation Results

| Organization | Backbone Model | KlueSTS average | KorSTS average |
| -------- | ------- | ------- | ------- |
| team-lucid | DeBERTa-base | 54.15 | 29.72 |
| monologg | Electra-base | 66.97 | 29.72 |
| LMkor | Electra-base | 70.98 | 43.09 |
| deliciouscat | DeBERTa-base | - | 67.65 |
| BM-K    | Roberta-base | 82.93 | **85.77** |
| Klue    | Roberta-large | **86.71** | 71.70 |
| Klue (Hyperparameter searched) | Roberta-large | 86.21 | 75.54 |

기존 한국어 문장 임베딩 모델은 mnli, snli 등 영어 데이터셋을 기계번역하여 학습된 점을 참고삼아 Klue 데이터셋으로 대신 학습해 보았습니다.

그 결과, Klue-Roberta-large 모델 기반으로 학습했을 경우 KlueSTS 및 KorSTS 테스트셋에 모두에 대해 준수한 성능을 보여, 좀 더 elaborate한 representation을 형성하는 것으로 사료했습니다.

다만 평가 수치는 하이퍼파라미터 세팅, 시드 넘버 등으로 크게 달라질 수 있으므로 참고하시길 바랍니다.

## Training
NegativeRank loss -> simcse loss 로 학습했습니다.