**Model Summary**

Recently, IBM has introduced GneissWeb; a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. The models trained using GneissWeb dataset outperform those trained on FineWeb 1.1.0 by 2.14 percentage points in terms of average score computed on a set of 11 commonly used benchmarks.

In order to be able to reproduce GneissWeb, we provide here GneissWeb.Tech_classifier a technology category fastText classifier trained on: 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Positive documents: 400k documents randomly sampled from the documents labeled with technology category with a confidence score 0.95 and above.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Negative documents: 400k documents randomly sampled from the documents labeled with any category other than science, education, medical, and technology categories with a confidence score of 0.95 and above.


&nbsp;&nbsp;&nbsp;&nbsp; **Developers**: IBM Research

&nbsp;&nbsp;&nbsp;&nbsp; **Release Date**: Feb 10th, 2025

&nbsp;&nbsp;&nbsp;&nbsp; **License**: Apache 2.0.