PumeTu commited on
Commit
2da5c2f
·
1 Parent(s): 0815dd9

update readme

Browse files
Files changed (1) hide show
  1. README.md +137 -29
README.md CHANGED
@@ -11,30 +11,114 @@ model-index:
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # llama3.1-8b-legal-combine-ccl-16
18
-
19
- This model is a fine-tuned version of [/ist-project/scads/pumet/models/Meta-Llama-3.1-8B](https://huggingface.co//ist-project/scads/pumet/models/Meta-Llama-3.1-8B) on the /ist-project/scads/pumet/WangchanX/datasets/legal-combine-ccl dataset.
20
 
21
  ## Model description
22
-
23
- More information needed
24
-
25
- ## Intended uses & limitations
26
-
27
- More information needed
28
-
29
- ## Training and evaluation data
30
-
31
- More information needed
32
-
33
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ### Training hyperparameters
36
-
37
- The following hyperparameters were used during training:
38
  - learning_rate: 0.0002
39
  - train_batch_size: 4
40
  - eval_batch_size: 4
@@ -49,13 +133,37 @@ The following hyperparameters were used during training:
49
  - lr_scheduler_warmup_ratio: 0.1
50
  - num_epochs: 4
51
 
52
- ### Training results
53
-
54
-
55
-
56
- ### Framework versions
57
-
58
- - Transformers 4.44.2
59
- - Pytorch 2.4.1+cu121
60
- - Datasets 3.0.1
61
- - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  results: []
12
  ---
13
 
14
+ # Llama-3.1-Legal-ThaiCCL-8B
 
15
 
16
+ Llama-3.1-Legal-ThaiCCL-8B is a large language model built upon Llama-3.1-8B, designed to answer Thai legal questions. It is full finetuned on the [WangchanX Thai Legal dataset](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG) using the [WangchanX Finetuning pipeline](https://github.com/vistec-AI/WangchanX). The model is intended to be used with a supporting Retrieval-Augmented Generation (RAG) system which queries relevant supporting legal documents for the model to reference when responding to the questions.
 
 
17
 
18
  ## Model description
19
+ - Base model: [Meta Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
20
+ - Training Repository: [WangchanX Finetuning Pipeline](https://github.com/vistec-AI/WangchanX)
21
+ - Training Dataset: [WangchanX Thai Legal dataset](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG)
22
+ - License: [Meta's Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/)
23
+
24
+ ## Model Usage
25
+ ```python
26
+ from transformers import pipeline
27
+ import torch
28
+
29
+ EN_QA_TEMPLATE = "Given the user's query in the context of Thai legal matters, the RAG system retrieves the top_n related documents. From these documents, it's crucial to identify and utilize only the most relevant ones to craft an accurate and informative response.Context information is below.\n\n---------------------\nContext: Thai legal domain\nQuery: {query_str}\nRetrieved Documents: {context_str}\n---------------------\n\n Using the provided context information and the list of retrieved documents, you will focus on selecting the documents that are most relevant to the user's query. This selection process involves evaluating the content of each document for its pertinency to the query, ensuring that the response is based on accurate and contextually appropriate information.Based on the selected documents, you will synthesize a response that addresses the user's query, drawing directly from the content of these documents to provide a precise, legally informed answer.You must answer in Thai.\nAnswer:"
30
+
31
+ EN_SYSTEM_PROMPT_STR = """You are a legal assistant named Sommai (สมหมาย in Thai). You provide legal advice in a friendly, clear, and approachable manner. When answering questions, you reference the relevant law sections, including the name of the act or code they are from. You explain what these sections entail, including any associated punishments, fees, or obligations. Your tone is polite yet informal, making users feel comfortable, like consulting a trusted friend. If a question falls outside your knowledge, you must respond with the exact phrase: 'สมหมายไม่สามารถตอบคำถามนี้ได้ครับ'. You avoid making up information and guide users based on accurate legal references relevant to their situation. Where applicable, you provide practical advice, such as preparing documents, seeking medical attention, or contacting authorities. If asked about past Supreme Court judgments, you must state that you do not have information on those judgments at this time."""
32
+
33
+ query = "การร้องขอให้ศาลสั่งให้บุคคลเป็นคนไร้ความสามารถมีหลักเกณฑ์การพิจารณาอย่างไร"
34
+
35
+ context = """ประมวลกฎหมายแพ่งและพาณิชย์ มาตรา 33 ในคดีที่มีการร้องขอให้ศาลสั่งให้บุคคลใดเป็นคนไร้ความสามารถเพราะวิกลจริต ถ้าทางพิจารณาได้ความว่าบุคคลนั้นไม่วิกลจริต แต่มีจิตฟั่นเฟือนไม่สมประกอบ เมื่อศาลเห็นสมควรหรือเมื่อมีคำขอของคู่ความหรือของบุคคลตามที่ระบุไว้ในมาตรา 28 ศาลอาจสั่งให้บุคคลนั้นเป็นคนเสมือนไร้ความสามารถก็ได้ หรือในคดีที่มีการร้องขอให้ศาลสั่งให้���ุคคลใดเป็นคนเสมือนไร้ความสามารถเพราะมีจิตฟั่นเฟือนไม่สมประกอบ ถ้าทางพิจารณาได้ความว่าบุคคลนั้นวิกลจริต เมื่อมีคำขอของคู่ความหรือของบุคคลตามที่ระบุไว้ในมาตรา 28 ศาลอาจสั่งให้บุคคลนั้นเป็นคนไร้ความสามารถก็ได้"""
36
+
37
+
38
+ model_id = "airesearch/LLaMa3.1-8B-Legal-ThaiCCL-Combine"
39
+
40
+ pipeline = transformers.pipeline(
41
+ "text-generation",
42
+ model=model_id,
43
+ model_kwargs={"torch_dtype": torch.bfloat16},
44
+ device_map="auto",
45
+ )
46
+
47
+ sample = [
48
+ {"role": "system", "content": SYSTEM_PROMPT_STR},
49
+ {"role": "user", "content": QA_template.format(context_str=context, query_str=query)},
50
+ ]
51
+
52
+ prompt = pipeline.tokenizer.apply_chat_template(sample,
53
+ tokenize=False,
54
+ add_generation_prompt=True)
55
+
56
+ outputs = pipeline(
57
+ prompt,
58
+ max_new_tokens = 512,
59
+ eos_token_id = terminators,
60
+ do_sample = True,
61
+ temperature = 0.6,
62
+ top_p = 0.9
63
+ )
64
+
65
+ print(outputs[0]["generated_text"][-1])
66
+ ```
67
+
68
+ ## Training Data
69
+ The model is trained on the [WangchanX Legal ThaiCCL RAG dataset](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG), which is a Thai legal question-answering dataset created using a RAG system to query relevant supporting legal datasets based on a question for the LLM to reference in its answer. For more information on how the datasets was created please refer to this [blog](https://medium.com/airesearch-in-th/ชุดข้อมูลกฎหมายไทยสำหรับการพัฒนา-rag-0eb2eab283a1).
70
+
71
+ To emulate a real world use case, during training we incorporated both the positive and negative context (if available) into the prompt. We found that this resulted in a model that is more robust towards cases that the RAG system also passes in irrelevant contexts mixed with the correct context to reference (refer to the [evaluation](#evaluation) section for results).
72
+
73
+ ### Prompt Format
74
+ We recommend using the same chat template (system prompt and question template of context, query, and retreived documents) when using the provided weights, since the model was trained with the specific system prompt and question template. Example input prompt:
75
+ ```
76
+ <|begin_of_text|><|start_header_id|>system<|end_header_id|>
77
+ You are a legal assistant named Sommai (สมหมาย in Thai), you provide legal advice to users in a friendly and understandable manner. When answering questions, you specifically reference the law sections relevant to the query, including the name of the act or code they originated from, an explanation of what those sections entail, and any associated punishments or fees. Your tone is approachable and informal yet polite, making users feel as if they are seeking advice from a friend. If a question arises that does not match the information you possess, you must acknowledge your current limitations by stating this exactly sentence: 'สมหมายไม่สามารถตอบคำถามนี้ได้ครับ'. You will not fabricate information but rather guide users based on actual law sections relevant to their situation. Additionally, you offer practical advice on next steps, such as gathering required documents, seeking medical attention, or visiting a police station, as applicable. If inquired about past Supreme Court judgments, you must reply that you do not have information on those judgments yet.<|eot_id|>
78
+ <|start_header_id|>user<|end_header_id|>
79
+
80
+ Given the user's query in the context of Thai legal matters, the RAG system retrieves the top_n related documents. From these documents, it's crucial to identify and utilize only the most relevant ones to craft an accurate and informative response.
81
+
82
+ Context information is below.
83
+ ---------------------
84
+ Context: Thai legal domain
85
+ Query: {question}
86
+ Retreived Documents: {retreived legal documents}
87
+ ---------------------
88
+
89
+ Using the provided context information and the list of retrieved documents, you will focus on selecting the documents that are most relevant to the user's query. This selection process involves evaluating the content of each document for its pertinency to the query, ensuring that the response is based on accurate and contextually appropriate information.
90
+ Based on the selected documents, you will synthesize a response that addresses the user's query, drawing directly from the content of these documents to provide a precise, legally informed answer.
91
+ You must answer in Thai.
92
+ Answer:
93
+ <|eot_id|>
94
+ <|start_header_id|>assistant<|end_header_id|>
95
+ ```
96
+ Here is a Python code snippet on how to apply the chat template with the provided system prompt and question template on the WangchanX Legal Thai CCL dataset:
97
+ ```python
98
+ EN_QA_TEMPLATE = "Given the user's query in the context of Thai legal matters, the RAG system retrieves the top_n related documents. From these documents, it's crucial to identify and utilize only the most relevant ones to craft an accurate and informative response.Context information is below.\n\n---------------------\nContext: Thai legal domain\nQuery: {query_str}\nRetrieved Documents: {context_str}\n---------------------\n\n Using the provided context information and the list of retrieved documents, you will focus on selecting the documents that are most relevant to the user's query. This selection process involves evaluating the content of each document for its pertinency to the query, ensuring that the response is based on accurate and contextually appropriate information.Based on the selected documents, you will synthesize a response that addresses the user's query, drawing directly from the content of these documents to provide a precise, legally informed answer.You must answer in Thai.\nAnswer:"
99
+
100
+ EN_SYSTEM_PROMPT_STR = """You are a legal assistant named Sommai (สมหมาย in Thai). You provide legal advice in a friendly, clear, and approachable manner. When answering questions, you reference the relevant law sections, including the name of the act or code they are from. You explain what these sections entail, including any associated punishments, fees, or obligations. Your tone is polite yet informal, making users feel comfortable, like consulting a trusted friend. If a question falls outside your knowledge, you must respond with the exact phrase: 'สมหมายไม่สามารถตอบคำถามนี้ได้ครับ'. You avoid making up information and guide users based on accurate legal references relevant to their situation. Where applicable, you provide practical advice, such as preparing documents, seeking medical attention, or contacting authorities. If asked about past Supreme Court judgments, you must state that you do not have information on those judgments at this time."""
101
+
102
+ def format(example):
103
+ if "คำตอบ: " in example["positive_answer"]:
104
+ example["positive_answer"] = example["positive_answer"].replace("คำตอบ: ", "")
105
+ if example['positive_contexts']:
106
+ context = ''.join([v['text'] for v in example['positive_contexts'][:5]])
107
+ message = [
108
+ {"content": EN_SYSTEM_PROMPT_STR, "role": "system"},
109
+ {"content": EN_QA_TEMPLATE.format(query_str=example['question'], context_str=context), "role": "user"},
110
+ ]
111
+ else:
112
+ message = [
113
+ {"content": EN_SYSTEM_PROMPT_STR, "role": "system"},
114
+ {"content": EN_QA_TEMPLATE.format(query_str=example['question'], context_str=" "), "role": "user"},
115
+ ]
116
+ return dict(messages=message)
117
+ dataset = dataset.map(format, batched=False)
118
+ ```
119
 
120
  ### Training hyperparameters
121
+ We full fine-tuned Llama-3.1-8B using the following hyperparameters:
 
122
  - learning_rate: 0.0002
123
  - train_batch_size: 4
124
  - eval_batch_size: 4
 
133
  - lr_scheduler_warmup_ratio: 0.1
134
  - num_epochs: 4
135
 
136
+ Total training time: 2:15:14.66
137
+
138
+ ## Evaluation
139
+ We tested our model based on the test set of the WangchanX Legal Thai CCL dataset using both traditional (MRC) metrics and a LLM as judge technique based on the paper [CHIE: Generative MRC Evaluation for in-context QA with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects](https://aclanthology.org/2024.genbench-1.10.pdf)
140
+
141
+ Note: `LLaMa3.1-8B-Legal-ThaiCCL` is trained on only positive contexts while `LLaMa3.1-8B-Legal-ThaiCCL-Combine` is trained on both positive and negative contexts
142
+
143
+ ### Table 1: MRC Results
144
+ | Model | Context Type | Answer Type | ROUGE-L | Character Error Rate (CER) | Word Error Rate (WER) | BERT Score | F1-score XQuAD | Exact Match XQuAD |
145
+ |--------------------------------------|-------------------|---------------|----------|----------------------------|-----------------------|------------|----------------|-------------------|
146
+ | Zero-shot LLaMa3.1-8B-Instruct | Golden Passage | Only Positive | 0.553 | 1.181 | 1.301 | 0.769 | 48.788 | 0.0 |
147
+ | LLaMa3.1-8B-Legal-ThaiCCL | Golden Passage | Only Positive | 0.603 | 0.667 | 0.736 | 0.821 | 60.039 | 0.053 |
148
+ | LLaMa3.1-8B-Legal-ThaiCCL-Combine | Golden Passage | Only Positive | 0.715 | 0.695 | 0.758 | 0.833 | 64.578 | 0.614 |
149
+ | Zero-shot LLaMa3.1-70B-Instruct | Golden Passage | Only Positive | 0.830 | 0.768 | 0.848 | 0.830 | 61.497 | 0.0 |
150
+ | Zero-shot LLaMa3.1-8B-Instruct | Retrieval Passage | Only Positive | 0.422 | 1.631 | 1.773 | 0.757 | 39.639 | 0.0 |
151
+ | LLaMa3.1-8B-Legal-ThaiCCL | Retrieval Passage | Only Positive | 0.366 | 1.078 | 1.220 | 0.779 | 44.238 | 0.03 |
152
+ | LLaMa3.1-8B-Legal-ThaiCCL-Combine | Retrieval Passage | Only Positive | 0.516 | 0.884 | 0.884 | 0.816 | 54.948 | 0.668 |
153
+ | Zero-shot LLaMa3.1-70B-Instruct | Retrieval Passage | Only Positive | 0.616 | 0.934 | 1.020 | 0.816 | 54.930 | 0.0 |
154
+
155
+ ### Table 2: CHIE Results
156
+ | Model | Context Type | Answer Type | Q1: Correctness [H] | Q2: Helpfulness [H] | Q3: Irrelevancy [L] | Q4: Out-of-Context [L] |
157
+ |--------------------------------------|-------------------|---------------|----------------------|----------------------|---------------------|------------------------|
158
+ | Zero-shot LLaMa3.1-8B-Instruct | Golden Passage | Only Positive | 0.740 | 0.808 | 0.480 | 0.410 |
159
+ | LLaMa3.1-8B-Legal-ThaiCCL | Golden Passage | Only Positive | 0.705 | 0.486 | 0.294 | 0.208 |
160
+ | LLaMa3.1-8B-Legal-ThaiCCL-Combine | Golden Passage | Only Positive | 0.565 | 0.468 | 0.405 | 0.325 |
161
+ | Zero-shot LLaMa3.1-70B-Instruct | Golden Passage | Only Positive | 0.870 | 0.658 | 0.316 | 0.247 |
162
+ | Zero-shot LLaMa3.1-8B-Instruct | Retrieval Passage | Only Positive | 0.480 | 0.822 | 0.557 | 0.248 |
163
+ | LLaMa3.1-8B-Legal-ThaiCCL | Retrieval Passage | Only Positive | 0.274 | 0.470 | 0.720 | 0.191 |
164
+ | LLaMa3.1-8B-Legal-ThaiCCL-Combine | Retrieval Passage | Only Positive | 0.532 | 0.445 | 0.508 | 0.203 |
165
+ | Zero-shot LLaMa3.1-70B-Instruct | Retrieval Passage | Only Positive | 0.748 | 0.594 | 0.364 | 0.202 |
166
+
167
+
168
+ ## License and use
169
+ The model is released under [Meta's Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc.