|
--- |
|
base_model: |
|
- openchat/openchat_3.5 |
|
language: |
|
- ko |
|
- en |
|
library_name: adapter-transformers |
|
license: mit |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-generation |
|
tags: |
|
- finance |
|
- biology |
|
- legal |
|
- art |
|
- text-generation-inference |
|
datasets: |
|
- AIDX-ktds/ko_leaderboard |
|
--- |
|
|
|
### β± ktdsbaseLM v0.11μ openchat3.5λ₯Ό Foundation λͺ¨λΈλ‘ νλ νκ΅μ΄ λ° νκ΅μ λ€μν |
|
### λ¬Ένμ μ μ©ν μ μλλ‘ νκΈ° μν΄ |
|
### κ°λ° λμμΌλ©° μ체 μ μν 53μμμ νκ΅μ΄ λ°μ΄ν°λ₯Ό νμ©νμ¬ νκ΅ μ¬ν κ°μΉμ |
|
### λ¬Ένλ₯Ό μ΄ν΄νλ λͺ¨λΈ μ
λλ€. β |
|
|
|
|
|
|
|
# βΆ λͺ¨λΈ μ€λͺ
|
|
- λͺ¨λΈλͺ
λ° μ£ΌμκΈ°λ₯: |
|
KTDSbaseLM v0.11μ OpenChat 3.5 λͺ¨λΈμ κΈ°λ°μΌλ‘ SFT λ°©μμΌλ‘ νμΈνλλ Mistral 7B / openchat3.5 κΈ°λ° λͺ¨λΈμ
λλ€. |
|
νκ΅μ΄μ νκ΅μ λ€μν λ¬Ένμ λ§₯λ½μ μ΄ν΄νλλ‘ μ€κ³λμμΌλ©° β¨β¨, μ체 μ μν 135κ° μμμ νκ΅μ΄ |
|
λ°μ΄ν°λ₯Ό νμ©ν΄ νκ΅ μ¬νμ κ°μΉμ λ¬Ένλ₯Ό λ°μν©λλ€. |
|
μ£Όμ κΈ°λ₯μΌλ‘λ ν
μ€νΈ μμ±, λν μΆλ‘ , λ¬Έμ μμ½, μ§μμλ΅, κ°μ λΆμ λ° μμ°μ΄ μ²λ¦¬ κ΄λ ¨ λ€μν μμ
μ μ§μνλ©°, |
|
νμ© λΆμΌλ λ²λ₯ , μ¬λ¬΄, κ³Όν, κ΅μ‘, λΉμ¦λμ€, λ¬Έν μ°κ΅¬ λ± λ€μν λΆμΌμμ μμ©λ μ μμ΅λλ€. |
|
- λͺ¨λΈ μν€ν
μ²: KTDSBaseLM v0.11μ Mistral 7B λͺ¨λΈμ κΈ°λ°μΌλ‘, νλΌλ―Έν° μλ 70μ΅ κ°(7B)λ‘ κ΅¬μ±λ κ³ μ±λ₯ μΈμ΄ λͺ¨λΈμ
λλ€. |
|
μ΄ λͺ¨λΈμ OpenChat 3.5λ₯Ό νμ΄λ°μ΄μ
λͺ¨λΈλ‘ μΌμ, SFT(μ§λ λ―ΈμΈ μ‘°μ ) λ°©μμ ν΅ν΄ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλ μ±λ₯μ λ°ννλλ‘ νλ ¨λμμ΅λλ€. |
|
Mistral 7Bμ κ²½λνλ ꡬ쑰λ λΉ λ₯Έ μΆλ‘ μλμ λ©λͺ¨λ¦¬ ν¨μ¨μ±μ 보μ₯νλ©°, λ€μν μμ°μ΄ μ²λ¦¬ μμ
μ μ ν©νκ² μ΅μ νλμ΄ μμ΅λλ€. |
|
μ΄ μν€ν
μ²λ ν
μ€νΈ μμ±, μ§μμλ΅, λ¬Έμ μμ½, κ°μ λΆμκ³Ό κ°μ λ€μν μμ
μμ νμν μ±λ₯μ 보μ¬μ€λλ€. |
|
|
|
# β· νμ΅ λ°μ΄ν° |
|
- ktdsbaseLM v0.11μ μ체 κ°λ°ν μ΄ 3.6GB ν¬κΈ°μ λ°μ΄ν°λ₯Ό λ°νμΌλ‘ νμ΅λμμ΅λλ€. λͺ¨λ 233λ§ κ±΄μ QnA, μμ½, λΆλ₯ λ± λ°μ΄ν°λ₯Ό ν¬ν¨νλ©°, |
|
κ·Έ μ€ 133λ§ κ±΄μ 53κ° μμμ κ°κ΄μ λ¬Έμ λ‘ κ΅¬μ±λμμ΅λλ€. μ΄ μμμλ νκ΅μ¬, μ¬ν, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν, μλ¬Ό, 물리, νν λ±μ΄ ν¬ν¨λλ©°, |
|
Chain of Thought λ°©μμΌλ‘ νμ΅λμμ΅λλ€. λν 130λ§ κ±΄μ μ£Όκ΄μ λ¬Έμ λ νκ΅μ¬, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν λ± 38κ° μμμ κ±Έμ³ νμ΅λμμ΅λλ€. |
|
νμ΅ λ°μ΄ν° μ€ νκ΅μ μ¬ν κ°μΉμ μΈκ°μ κ°μ μ μ΄ν΄νκ³ μ§μν μ¬νμ λ°λΌ μΆλ ₯ν μ μλ λ°μ΄ν°λ₯Ό νμ΅νμμ΅λλ€. |
|
|
|
|
|
# βΈ μ¬μ© μ¬λ‘ |
|
ktdsbaseLM v0.11μ λ€μν μμ© λΆμΌμμ μ¬μ©λ μ μμ΅λλ€. μλ₯Ό λ€μ΄: |
|
- κ΅μ‘ λΆμΌ: μμ¬, μν, κ³Όν λ± λ€μν νμ΅ μλ£μ λν μ§μμλ΅ λ° μ€λͺ
μμ±. |
|
- λΉμ¦λμ€: λ²λ₯ , μ¬λ¬΄, μΈλ¬΄ κ΄λ ¨ μ§μμ λν λ΅λ³ μ 곡 λ° λ¬Έμ μμ½. |
|
- μ°κ΅¬ λ° λ¬Έν: νκ΅ μ¬νμ λ¬Ένμ λ§μΆ μμ°μ΄ μ²λ¦¬ μμ
, κ°μ λΆμ, λ¬Έμ μμ± λ° λ²μ. |
|
- κ³ κ° μλΉμ€: μ¬μ©μμμ λν μμ± λ° λ§μΆ€ν μλ΅ μ 곡. |
|
- μ΄ λͺ¨λΈμ λ€μν μμ°μ΄ μ²λ¦¬ μμ
μμ λμ νμ©λλ₯Ό κ°μ§λλ€. |
|
|
|
# βΉ νκ³ ββ |
|
- ktdsBaseLM v0.11μ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλμ΄ μμΌλ, |
|
νΉμ μμ(μ: μ΅μ κ΅μ μλ£, μ λ¬Έ λΆμΌ)μ λ°μ΄ν° λΆμ‘±μΌλ‘ μΈν΄ λ€λ₯Έ μΈμ΄ λλ |
|
λ¬Ένμ λν μλ΅μ μ νμ±μ΄ λ¨μ΄μ§ μ μμ΅λλ€. |
|
λν, 볡μ‘ν λ
Όλ¦¬μ μ¬κ³ λ₯Ό μꡬνλ λ¬Έμ μ λν΄ μ νλ μΆλ‘ λ₯λ ₯μ λ³΄μΌ μ μμΌλ©°, |
|
νΈν₯λ λ°μ΄ν°κ° ν¬ν¨λ κ²½μ° νΈν₯λ μλ΅μ΄ μμ±λ κ°λ₯μ±λ μ‘΄μ¬ν©λλ€. |
|
|
|
# βΊ μ¬μ© λ°©λ² |
|
<pre><code> |
|
import os |
|
import os.path as osp |
|
import sys |
|
import fire |
|
import json |
|
from typing import List, Union |
|
import pandas as pd |
|
import torch |
|
from torch.nn import functional as F |
|
|
|
import transformers |
|
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig |
|
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
from datasets import load_dataset |
|
|
|
from peft import ( |
|
LoraConfig, |
|
get_peft_model, |
|
set_peft_model_state_dict |
|
) |
|
from peft import PeftModel |
|
import re |
|
import ast |
|
|
|
device = 'auto' #@param {type: "string"} |
|
model = '' #@param {type: "string"} |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model, |
|
quantization_config=bnb_config, |
|
#load_in_4bit=True, # Quantization Load |
|
device_map=device) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model) |
|
|
|
input_text = "μλ
νμΈμ." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
inputs = inputs.to("cuda:0") |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_length=1024) |
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
</code></pre> |
|
|
|
## β
ktdsλ openchat μΈμλ LlaMA, Polyglot, EEVE λ± λνμ μΈ LLMμ λ€μν μμμ νκ΅μ λ¬Ένμ μ§μμ νμΈνλν LLMμ μ 곡ν μμ μ
λλ€. |
|
--- |
|
Hereβs the English version of the provided text: |
|
|
|
|
|
|
|
# βΆ Model Description |
|
|
|
**Model Name and Key Features**: |
|
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model. |
|
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society. |
|
The model supports tasks such as text generation, conversation inference, document summarization, |
|
question answering, sentiment analysis, and other NLP tasks. |
|
Its applications span fields like law, finance, science, education, business, and cultural research. |
|
|
|
**Model Architecture**: |
|
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model. |
|
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture. |
|
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency, |
|
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis. |
|
|
|
--- |
|
|
|
# β· Training Data |
|
|
|
KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances. |
|
This includes 1.33 million multiple-choice questions across 53 domains such as history, |
|
finance, law, tax, and science, trained with the Chain of Thought method. Additionally, |
|
1.3 million short-answer questions cover 38 domains including history, finance, and law. |
|
|
|
**Training Instruction Dataset Format**: |
|
`{"prompt": "prompt text", "completion": "ideal generated text"}` |
|
|
|
--- |
|
|
|
# βΈ Use Cases |
|
|
|
KTDSbaseLM v0.11 can be used across multiple fields, such as: |
|
|
|
- **Education**: Answering questions and generating explanations for subjects like history, math, and science. |
|
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries. |
|
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation. |
|
- **Customer Service**: Generating conversations and personalized responses for users. |
|
|
|
This model is highly versatile in various NLP tasks. |
|
|
|
--- |
|
|
|
# βΉ Limitations |
|
|
|
KTDSBaseLM v0.11 is specialized in Korean language and culture. |
|
However, it may lack accuracy in responding to topics outside its scope, |
|
such as international or specialized data. |
|
Additionally, it may have limited reasoning ability for complex logical problems and |
|
may produce biased responses if trained on biased data. |
|
|
|
--- |
|
|
|
# βΊ Usage Instructions |
|
<pre><code> |
|
import os |
|
import os.path as osp |
|
import sys |
|
import fire |
|
import json |
|
from typing import List, Union |
|
import pandas as pd |
|
import torch |
|
from torch.nn import functional as F |
|
|
|
import transformers |
|
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig |
|
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
from datasets import load_dataset |
|
|
|
from peft import ( |
|
LoraConfig, |
|
get_peft_model, |
|
set_peft_model_state_dict |
|
) |
|
from peft import PeftModel |
|
import re |
|
import ast |
|
|
|
device = 'auto' #@param {type: "string"} |
|
model = '' #@param {type: "string"} |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model, |
|
quantization_config=bnb_config, |
|
#load_in_4bit=True, # Quantization Load |
|
device_map=device) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model) |
|
|
|
input_text = "μλ
νμΈμ." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
inputs = inputs.to("cuda:0") |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_length=1024) |
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
</code></pre> |
|
|
|
## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, |
|
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. |
|
## These models will be tailored to better understand and generate content specific to Korean contexts. |