Spaces:
Running
Running
File size: 2,592 Bytes
290ce2b ca023a8 290ce2b 18676fe 5872b44 e2b668a 5872b44 143a8fe 5872b44 3049544 5872b44 3049544 5872b44 ca023a8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
---
title: README
emoji: π
colorFrom: gray
colorTo: red
sdk: static
pinned: false
---
## tasksource: 600+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
`tasksource` automates this and facilitates reproducible multi-task learning scaling.
Each dataset is standardized to either `MultipleChoice`, `Classification`, or `TokenClassification` dataset with identical fields. We do not support generation tasks as they are addressed by [promptsource](https://github.com/bigscience-workshop/promptsource). All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
GitHub: https://github.com/sileod/tasksource
### Installation and usage:
`pip install tasksource`
```python
from tasksource import list_tasks, load_task
df = list_tasks()
for id in df[df.task_type=="MultipleChoice"].id:
dataset = load_task(id)
# all yielded datasets can be used interchangeably
```
See supported 600+ tasks in [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md) (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (as any huggingface dataset), so be sure to have >100GB of space there.
### Pretrained model:
Text encoder pretrained on tasksource reached state-of-the-art results: [π€/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli)
### Contact and citation
I can help you integrate tasksource in your experiments. `[email protected]`
More details on this [article:](https://aclanthology.org/2024.lrec-main.1361/)
```bib
@inproceedings{sileo-2024-tasksource-large,
title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework",
author = "Sileo, Damien",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1361",
pages = "15655--15684",
}
``` |