SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning

📃 [SciGLM] [GitHub]

SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning. Central to our approach is a novel self-reflective instruction annotation framework to address the data scarcity challenge in the science domain. This framework leverages existing LLMs to generate step-by-step reasoning for unlabelled scientific questions, followed by a process of self-reflective critic-and-revise. Applying this framework, we curated SciInstruct, a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.

SciInstruct

We construct the SciInstruct as follows:

Subject	Math	Physics& Chemistry	Formal Proofs (Lean)	Total
# Number	89,934	123,869	40,248	254,051

We release our data and model for public use. If you wish to use SciInstruct or SciGLM, you can download them from the following links.

Download data: [Google Drive] [Tsinghua Cloud]

Download model: [Hugging Face]

Training & Inference

Fine-tuning

You can use the SciGLM model through Huggingface's Transformers library.

git clone https://github.com/THUDM/SciGLM.git
cd SciGLM
pip install -r requirements.txt

To train the 6B model, run:

bash /path/training/finetune.sh

Inference

cd /path/to/inference
python cli_demo.py

Citation

If you find our work helpful, please kindly cite our paper:

@article{zhang2024sciglm,
  title={Sciglm: Training scientific language models with self-reflective instruction annotation and tuning},
  author={Zhang, Dan and Hu, Ziniu and Zhoubian, Sining and Du, Zhengxiao and Yang, Kaiyu and Wang, Zihan and Yue, Yisong and Dong, Yuxiao and Tang, Jie},
  journal={arXiv preprint arXiv:2401.07950},
  year={2024}
}

zd21
/

SciGLM-6B