Junhoee
/

Kobart-Jeju-translation

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

Kobart-Jeju-translation / README.md

Junhoee's picture

Update README.md

5650d47 verified 3 months ago

|

history blame contribute delete

3.36 kB

	---
	language:
	- ko
	metrics:
	- bleu
	pipeline_tag: text2text-generation
	---
	# 🌊 제주어, 표준어 양방향 번역 모델 (Jeju-Standard Bidirectional Translation Model)
	## 1. Introduction
	### 🧑‍🤝‍🧑Member
	- Bitamin 12기 : 구준회, 이서현, 이예린
	- Bitamin 13기 : 김윤영, 김재겸, 이형석

	### Github Link
	- https://github.com/junhoeKu/Jeju_Translation.github.io

	### How to use this Model
	- You can use this model with `transformers` to perform inference.
	- Below is an example of how to load the model and generate translations:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	## Set up the device (GPU or CPU)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	## Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation")
	model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device)

	## Set up the input text
	## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
	input_text = "[표준] 안녕하세요"

	## Tokenize the input text
	input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

	## Generate the translation
	outputs = model.generate(input_ids, max_length=64)

	## Decode and print the output
	decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print("Model Output:", decoded_output)
	```
	```java
	Model Output: 안녕하수꽈
	```

	---

	```python
	## Set up the input text
	## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
	input_text = "[제주] 안녕하수꽈"

	## Tokenize the input text
	input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

	## Generate the translation
	outputs = model.generate(input_ids, max_length=64)

	## Decode and print the output
	decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print("Model Output:", decoded_output)
	```
	```java
	Model Output: 안녕하세요
	```

	### Parent Model
	- gogamza/kobart-base-v2
	- https://huggingface.co./gogamza/kobart-base-v2

	## 2. Dataset - 약 93만 개의 행
	- AI-Hub (제주어 발화 데이터 + 중년층 방언 발화 데이터)
	- Github (카카오브레인 JIT 데이터)
	- 그 외
	- 제주어사전 데이터 (제주도청 홈페이지에서 크롤링)
	- 가사 번역 데이터 (뭐랭하맨 유튜브에서 일일이 수집)
	- 도서 데이터 (제주방언 그 맛과 멋, 부에나도 지꺼져도 도서에서 일일이 수집)
	- 2018년도 제주어 구술 자료집 (일일이 수집 - 평가용 데이터로 사용)

	## 3. Hyper Parameters
	- Epoch : 3 epochs
	- Learning Rate : 2e-5
	- Weight Decay=0.01
	- Batch Size : 32

	## 4. Bleu Score
	- 2018 제주어 구술 자료집 데이터 기준
	- 제주어 -> 표준어 : 0.76
	- 표준어 -> 제주어 : 0.5

	- AI-Hub 제주어 발화 데이터의 validation data 기준
	- 제주어 -> 표준어 : 0.89
	- 표준어 -> 제주어 : 0.77

	## 5. CREDIT
	- 구준회 : [email protected]
	- 김윤영 : [email protected]
	- 김재겸 : [email protected]
	- 이서현 : [email protected]
	- 이예린 : [email protected]
	- 이형석 : [email protected]