Update README.md
Browse files
README.md
CHANGED
@@ -7,70 +7,3 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
|
11 |
-
# Documentation for New Datasets of PIXIU
|
12 |
-
|
13 |
-
## Overview
|
14 |
-
This document provides instructions for creating and integrating new dataset classes into the PIXIU Language Learning Model (LLM) dataset creation script. The script is designed to process, construct, and upload custom datasets for specific tasks like classification or abstractive summarization.
|
15 |
-
|
16 |
-
## Creating a New Dataset Class
|
17 |
-
To add a custom dataset to the script, create a new class in `preprocess.py` using the following template.
|
18 |
-
|
19 |
-
### Example Class: `MedMCQA`
|
20 |
-
```python
|
21 |
-
class MedMCQA(InstructionDataset):
|
22 |
-
dataset = "MedMCQA"
|
23 |
-
task_type = "classification"
|
24 |
-
choices = ["A", "B", "C", "D"]
|
25 |
-
prompt = """Given a medical context and a multiple choice question related to it, select the correct answer from the four options.
|
26 |
-
Question: {text}
|
27 |
-
Options: {options}.
|
28 |
-
Please answer with A, B, C, or D only.
|
29 |
-
Answer:
|
30 |
-
"""
|
31 |
-
|
32 |
-
def fetch_data(self, datum):
|
33 |
-
return {
|
34 |
-
"text": datum["question"], "options": ', '.join([op+': '+datum[k] for k, op in zip(['opa', 'opb', 'opc', 'opd'], self.choices)]),
|
35 |
-
"answer": self.choices[datum["cop"]-1],
|
36 |
-
}
|
37 |
-
```
|
38 |
-
|
39 |
-
#### Key Components:
|
40 |
-
- `dataset`: Name of the dataset.
|
41 |
-
- `task_type`: Type of the task (e.g., `classification`, `abstractivesummarization`).
|
42 |
-
- `choices`: Set of labels for classification tasks.
|
43 |
-
- `prompt`: Template for constructing the task prompt.
|
44 |
-
- `fetch_data`: Method to extract necessary information from raw data.
|
45 |
-
|
46 |
-
### Integrating the New Class
|
47 |
-
After creating the class, append it to the `DATASETS` dictionary in `preprocess.py`:
|
48 |
-
|
49 |
-
```python
|
50 |
-
DATASETS = {
|
51 |
-
"MedMCQA": MedMCQA,
|
52 |
-
}
|
53 |
-
```
|
54 |
-
|
55 |
-
## Using the Script
|
56 |
-
To use the script with the new dataset, run the following command:
|
57 |
-
|
58 |
-
```bash
|
59 |
-
|
60 |
-
# Define the arguments
|
61 |
-
DATASET="Your Dataset"
|
62 |
-
TRAIN_FILENAME="Train Filename"
|
63 |
-
VALID_FILENAME="Valid Filename"
|
64 |
-
TEST_FILENAME="Test Filename"
|
65 |
-
|
66 |
-
# Call the Python script with the defined arguments
|
67 |
-
python preprocess.py \
|
68 |
-
--dataset $DATASET \
|
69 |
-
--train_filename $TRAIN_FILENAME \
|
70 |
-
--valid_filename $VALID_FILENAME \
|
71 |
-
--test_filename $TEST_FILENAME \
|
72 |
-
--for_eval
|
73 |
-
```
|
74 |
-
|
75 |
-
Note: Modify the parameters according to your dataset. Use `-for_eval` for evaluation datasets and omit it for instruction tuning datasets.
|
76 |
-
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|