metadata
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
Model Details
Model Developers: Sogang University SGEconFinlab
Model Description
This model is a language model specialized in economics and finance. This was learned with various economic/finance-related data. The data sources are listed below, and we are not releasing the data we trained on because it was used for research/policy purposes. If you wish to use the original data rather than our training data, please contact the original author directly for permission to use it.
- Developed by: Sogang University SGEconFinlab(https://sc.sogang.ac.kr/aifinlab/)
- Language(s) (NLP): Ko/En
- License: apache-2.0
- Base Model: yanolja/KoSOLAR-10.7B-v0.2
How to Get Started with the Model
peft_model_id = "SGEcon/KoSOLAR-10.7B-v0.2_fin_v4"
config = PeftConfig.from_pretrained(peft_model_id)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0})
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model.eval()
import re
def gen(x):
inputs = tokenizer(f"### ์ง๋ฌธ: {x}\n\n### ๋ต๋ณ:", return_tensors='pt', return_token_type_ids=False)
# Move data to GPU (if available)
inputs = {k: v.to(device="cuda" if torch.cuda.is_available() else "cpu") for k, v in inputs.items()}
gened = model.generate(
**inputs,
max_new_tokens=256,
early_stopping=True,
num_return_sequences=4,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
temperature=0.9,
top_p=0.8,
top_k=50
)
complete_answers = []
for gen_seq in gened:
decoded = tokenizer.decode(gen_seq, skip_special_tokens=True).strip()
# Extract only the text after the string "### ๋ต๋ณ:"
first_answer_start_idx = decoded.find("### ๋ต๋ณ:") + len("### ๋ต๋ณ:")
temp_answer = decoded[first_answer_start_idx:].strip()
# Extract only text up to the second "### ๋ต๋ณ:" string
second_answer_start_idx = temp_answer.find("### ๋ต๋ณ:")
if second_answer_start_idx != -1:
complete_answer = temp_answer[:second_answer_start_idx].strip()
else:
complete_answer = temp_answer # ๋ ๋ฒ์งธ "### ๋ต๋ณ:"์ด ์๋ ๊ฒฝ์ฐ ์ ์ฒด ๋ต๋ณ ๋ฐํ
complete_answers.append(complete_answer)
return complete_answers
Training Details
Training Data
- ํ๊ตญ์ํ: ๊ฒฝ์ ๊ธ์ต์ฉ์ด 700์ (https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765)
- ๊ธ์ต๊ฐ๋ ์: ๊ธ์ต์๋น์ ์ ๋ณด ํฌํธ ํ์ธ ๊ธ์ต์ฉ์ด์ฌ์ (https://fine.fss.or.kr/fine/fnctip/fncDicary/list.do?menuNo=900021)
- KDI ๊ฒฝ์ ์ ๋ณด์ผํฐ: ์์ฌ ์ฉ์ด์ฌ์ (https://eiec.kdi.re.kr/material/wordDic.do)
- ํ๊ตญ๊ฒฝ์ ์ ๋ฌธ/ํ๊ฒฝ๋ท์ปด: ํ๊ฒฝ๊ฒฝ์ ์ฉ์ด์ฌ์ (https://terms.naver.com/list.naver?cid=42107&categoryId=42107), ์ค๋์ TESAT(https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=1), ์ค๋์ ์ฃผ๋์ด TESAT(https://www.tesat.or.kr/bbs.frm.list/tesat_study?s_cateno=5), ์๊ธ์๊ธํ๊ฒฝ(https://sgsg.hankyung.com/tesat/study)
- ์ค์๋ฒค์ฒ๊ธฐ์ ๋ถ/๋ํ๋ฏผ๊ตญ์ ๋ถ: ์ค์๋ฒค์ฒ๊ธฐ์ ๋ถ ์ ๋ฌธ์ฉ์ด(https://terms.naver.com/list.naver?cid=42103&categoryId=42103)
- ๊ณ ์ฑ์ผ/๋ฒ๋ฌธ์ถํ์ฌ: ํ๊ณยท์ธ๋ฌด ์ฉ์ด์ฌ์ (https://terms.naver.com/list.naver?cid=51737&categoryId=51737)
- ๋งจํ์ ๊ฒฝ์ ํ 8ํ Word Index
Training Procedure
Training Hyperparameters
- Lora
- r=16, lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"], # this is different by models lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Results
[More Information Needed]