File size: 5,354 Bytes

---
language: ko
license: mit
tags:
- bart
---


# Model Card for kobart-base-v2 

 
# Model Details
 
## Model Description
 
[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)는 입력 텍스트 일부에 노이즈를 추가하여 이를 다시 원문으로 복구하는 `autoencoder`의 형태로 학습이 됩니다. 한국어 BART(이하 **KoBART**) 는 논문에서 사용된 `Text Infilling` 노이즈 함수를 사용하여 **40GB** 이상의 한국어 텍스트에 대해서 학습한 한국어 `encoder-decoder` 언어 모델입니다. 이를 통해 도출된 `KoBART-base`를 배포합니다.

- **Developed by:** More information needed
- **Shared by [Optional]:** Heewon(Haven) Jeon
- **Model type:** Feature Extraction 
- **Language(s) (NLP):** Korean
- **License:** MIT
- **Parent Model:** BART
- **Resources for more information:**
  - [GitHub Repo](https://github.com/haven-jeon/KoBART)
   - [Model Demo Space](https://huggingface.co./spaces/gogamza/kobart-summarization)
 	


# Uses
 

## Direct Use
This model can be used for the task of Feature Extraction.
 
## Downstream Use [Optional]
 
More information needed.
 
## Out-of-Scope Use
 
The model should not be used to intentionally create hostile or alienating environments for people. 
 
# Bias, Risks, and Limitations
 
 
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.



## Recommendations
 
 
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

# Training Details
 
## Training Data
 
| Data  | # of Sentences |
|-------|---------------:|
| Korean Wiki |     5M   |  
| Other corpus |  0.27B    | 
 
한국어 위키 백과 이외, 뉴스, 책, [모두의 말뭉치 v1.0(대화, 뉴스, ...)](https://corpus.korean.go.kr/), [청와대 국민청원](https://github.com/akngs/petitions) 등의 다양한 데이터가 모델 학습에 사용되었습니다.
 
`vocab` 사이즈는 30,000 이며 대화에 자주 쓰이는 아래와 같은 이모티콘, 이모지 등을 추가하여 해당 토큰의 인식 능력을 올렸습니다. 
> 😀, 😁, 😆, 😅, 🤣, .. , `:-)`, `:)`, `-)`, `(-:`...
 
## Training Procedure

 
### Tokenizer
 
[`tokenizers`](https://github.com/huggingface/tokenizers) 패키지의 `Character BPE tokenizer`로 학습되었습니다. 
 
 


 
### Speeds, Sizes, Times
| Model       |  # of params |   Type   | # of layers  | # of heads | ffn_dim | hidden_dims | 
|--------------|:----:|:-------:|--------:|--------:|--------:|--------------:|
| `KoBART-base` |  124M  |  Encoder |   6     | 16      | 3072    | 768 | 
|               |        | Decoder |   6     | 16      | 3072    | 768 |

 
# Evaluation
 
 
## Testing Data, Factors & Metrics
 
### Testing Data
 
More information needed 
 
 
### Factors
More information needed
 
### Metrics
 
More information needed
 
 
## Results 
 
NSMC
- acc. : 0.901

The model authors also note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):

|   |  [NSMC](https://github.com/e9t/nsmc)(acc)  | [KorSTS](https://github.com/kakaobrain/KorNLUDatasets)(spearman) | [Question Pair](https://github.com/aisolab/nlp_classification/tree/master/BERT_pairwise_text_classification/qpair)(acc) |
|---|---|---|---|
| **KoBART-base**  | 90.24  | 81.66  | 94.34  |
 
# Model Examination
 
More information needed
 
# Environmental Impact
 
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
 
# Technical Specifications [optional]
 
## Model Architecture and Objective

More information needed 
 
## Compute Infrastructure
 
More information needed 
 
### Hardware
 
 
More information needed
 
### Software
 
More information needed.
 
# Citation

 
**BibTeX:**
 
 
More information needed.

 
 
 
 
# Glossary [optional]
More information needed 
 
# More Information [optional]
More information needed 

 
# Model Card Authors [optional]
 
 Heewon(Haven) Jeon in collaboration with Ezi Ozoani and the Hugging Face team


# Model Card Contact
 The model authors note in the [GitHub Repo](https://github.com/haven-jeon/KoBART):
`KoBART` 관련 이슈는 [이곳](https://github.com/SKT-AI/KoBART/issues)에 올려주세요.
 
# How to Get Started with the Model
 
Use the code below to get started with the model.
 
<details>
<summary> Click to expand </summary>

```python
 from transformers import PreTrainedTokenizerFast, BartModel

tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-base-v2')
model = BartModel.from_pretrained('gogamza/kobart-base-v2')
 ```
</details>