againeureka's picture
Update README.md
f984e6d verified
metadata
language:
  - ko
metrics:
  - accuracy
library_name: transformers

KLUE Robeta-base for legal documents

  • KLUE/Robeta-Base Model์„ ํŒ๊ฒฐ๋ฌธ์œผ๋กœ ์ด๋ค„์ง„ legal_text_merged02_light.txt ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ์žฌํ•™์Šต ์‹œํ‚จ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

Model Details

Model Description

  • Developed by: J.Park @ KETI
  • Model type: klue/roberta-base
  • Language(s) (NLP): korean
  • License: [More Information Needed]
  • Finetuned from model [optional]: [More Information Needed]

ํ•™์Šต ๋ฐฉ๋ฒ•

base_model = 'klue/roberta-base'
base_tokenizer = 'klue/roberta-base'

from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import AutoModel, AutoTokenizer
model = RobertaForMaskedLM.from_pretrained(base_model)
tokenizer = AutoTokenizer.from_pretrained(base_tokenizer)

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=fpath_dataset,
    block_size=512,
)

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=18,
    save_steps=100,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

train_metrics = trainer.train()
trainer.save_model(output_dir)
trainer.push_to_hub()