cris177 commited on
Commit
afcd0c1
·
verified ·
1 Parent(s): 2000efe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -20,4 +20,71 @@ If officer smith found a broken window at the crime scene then the arson occurre
20
  [/INST] Premise 1: If officer smith found a broken window at the crime scene then the arson occurred on elm street Premise 2: Officer smith found a broken window at the crime scene Conclusion: The arson occurred on Elm Street Type of argument: modus ponen Validity: True `<`/s`>`
21
  ```
22
 
23
- It was trained on my dataset cris177/Arguments (https://huggingface.co/datasets/cris177/Arguments)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  [/INST] Premise 1: If officer smith found a broken window at the crime scene then the arson occurred on elm street Premise 2: Officer smith found a broken window at the crime scene Conclusion: The arson occurred on Elm Street Type of argument: modus ponen Validity: True `<`/s`>`
21
  ```
22
 
23
+ It was trained on my dataset cris177/Arguments (https://huggingface.co/datasets/cris177/Arguments)
24
+
25
+
26
+ # Fine-Tuning a Large Language Model to Learn Arguments
27
+
28
+ Fine-tuning a large language model (LLM) to understand and generate logical arguments is a complex task. This article outlines the steps taken to fine-tune the LLaMA2-7B model, which included generating a dataset of arguments and evaluating the model's performance using a variety of benchmarks. Below are the detailed steps involved in this process.
29
+
30
+ ## Step 1: Generate a List of Statements with Each Respective Negation
31
+
32
+ The first step in creating a dataset for argument training involved generating a list of statements and their respective negations. This was accomplished using existing large language models (LLMs). By prompting the LLMs to produce a diverse set of statements and their negations, a foundational dataset was created. For instance:
33
+
34
+ - Statement: "The sky is blue."
35
+ - Negation: "The sky is not blue."
36
+
37
+ - Statement: "Cats are mammals."
38
+ - Negation: "Cats are not mammals."
39
+
40
+ ## Step 2: Generate Modus Ponens and Modus Tollens Arguments
41
+
42
+ Using combinations of the generated statements, we created lists of modus ponens and modus tollens arguments.
43
+
44
+ - **Modus Ponens**:
45
+ - If P, then Q.
46
+ - P.
47
+ - Therefore, Q.
48
+
49
+ Example:
50
+ - If it rains, the ground will be wet.
51
+ - It is raining.
52
+ - Therefore, the ground is wet.
53
+
54
+ - **Modus Tollens**:
55
+ - If P, then Q.
56
+ - Not Q.
57
+ - Therefore, not P.
58
+
59
+ Example:
60
+ - If it rains, the ground will be wet.
61
+ - The ground is not wet.
62
+ - Therefore, it is not raining.
63
+
64
+ ## Step 3: Generate Dataset of Arguments with Labels
65
+
66
+ Next, we created a comprehensive dataset of arguments, labeling each with its premises, conclusion, argument type (modus ponens or modus tollens), and validity. This structured dataset provided a rich resource for fine-tuning the LLaMA2-7B model. An example of a labeled data point is:
67
+
68
+ - Premises: "If it rains, the ground will be wet.", "It is raining."
69
+ - Conclusion: "The ground is wet."
70
+ - Argument Type: Modus Ponens
71
+ - Validity: Valid
72
+
73
+ ## Step 4: Fine-Tune LLaMA2-7B on the Dataset
74
+
75
+ With the dataset prepared, the next step was to fine-tune the LLaMA2-7B model. The fine-tuning process involved training the model on the dataset, adjusting its parameters to improve its understanding and generation of logical arguments. This process included multiple training epochs and evaluations to ensure the model was learning effectively.
76
+
77
+ ## Step 5: Evaluating through Open-LLM-Leaderboard
78
+
79
+ Finally, the fine-tuned model was evaluated using the Open-LLM-Leaderboard, which benchmarks LLMs through various tests:
80
+
81
+ - **AI2 Reasoning Challenge (25-shot)**: A set of grade-school science questions.
82
+ - **HellaSwag (10-shot)**: A test of commonsense inference, challenging for state-of-the-art models.
83
+ - **MMLU (5-shot)**: Measures multitask accuracy across 57 tasks, including mathematics, history, computer science, and law.
84
+ - **TruthfulQA (0-shot)**: Evaluates the model's tendency to reproduce common falsehoods found online. Although termed 0-shot, it includes six Q/A pairs for context.
85
+ - **Winogrande (5-shot)**: A difficult benchmark for commonsense reasoning.
86
+ - **GSM8k (5-shot)**: Tests the model's ability to solve multi-step mathematical word problems.
87
+
88
+ ### Evaluation Results
89
+
90
+ TBD