Hajar-Emami's picture
Update README.md
975e0a8 verified
|
raw
history blame
2.64 kB
**Model Summary**
Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform those trained on FineWeb 1.1.0 by 2.14 percentage points in terms of average score computed on a set of 11 commonly used benchmarks.
In order to be able to reproduce GneissWeb, we provide here GneissWeb.Tech_classifier a technology category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with technology content.
**Intended Use**
The fastText model takes as input text and classifies whether the text categorized as ''technology'' (labeled as `__label__hq`) or other categories''cc'' (labeled as `__label__cc`).
The model can be used with python (please refer to [fastText documentation](https://fasttext.cc/docs/en/python-module.html) for details on using fastText classifiers)
or with [IBM Data Prep Kit](https://github.com/IBM/data-prep-kit/) (DPK) (please refer to the [example notebook](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/gneissweb_classification/gneissweb_classification.ipynb) for using a fastText model with DPK).
The GneissWeb ensemble filter uses the confidence score given to `__label__hq` for filtering documents based on an appropriately chosen threshold.
The fastText model is used along with [GneissWeb.Edu_classifier](https://huggingface.co./ibm-granite/GneissWeb.Edu_classifier), [GneissWeb.Sci_classifier](https://huggingface.co./ibm-granite/GneissWeb.Sci_classifier), and [GneissWeb.Med_classifier](https://huggingface.co./ibm-granite/GneissWeb.Med_classifier) and other quality annotators.
     **Developers**: IBM Research
     **Release Date**: Feb 10th, 2025
     **License**: Apache 2.0.
**Training Data**
The model is trained on 800k documents, labeled using the [WatsonNLP hierachical categorization](https://www.ibm.com/docs/en/watsonx/saas?topic=catalog-hierarchical-categorization). Please refer to [fastText text classification tutorial](https://fasttext.cc/docs/en/python-module.html) for details.
Training data is selected as follows:
- *Positive documents*: 400k documents randomly sampled from the documents labeled with technology category with a confidence score 0.95 and above.
- *Negative documents*: 400k documents randomly sampled from the documents labeled with any category other than science, education, medical, and technology categories with a confidence score of 0.95 and above.