File size: 5,249 Bytes

cca1b1c
d89aafa
 
 
 
 
 
 
cca1b1c
 
0cd0d00
d89aafa
0cd0d00
d89aafa
0cd0d00
 
d89aafa
 
 
0cd0d00
d89aafa
 
 
0cd0d00
d89aafa
0cd0d00
d89aafa
0cd0d00
d89aafa
0cd0d00
d89aafa
0cd0d00
 
 
d89aafa
0cd0d00
 
 
d89aafa
0cd0d00
 
 
d89aafa
0cd0d00
 
 
 
 
 
 
 
 
d89aafa
0cd0d00
 
 
 
d89aafa
0cd0d00
d89aafa
0cd0d00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d89aafa
0cd0d00
d89aafa
0cd0d00
d89aafa
0cd0d00
 
d89aafa
0cd0d00
d89aafa
0cd0d00
 
 
 
d89aafa
0cd0d00
 
d89aafa
0cd0d00
d89aafa
0cd0d00
 
 
 
d89aafa
0cd0d00
 
d89aafa
0cd0d00
d89aafa
0cd0d00

---
license: agpl-3.0
language:
- en
- zh
tags:
- AI4S
- MoE
---

# SciDFM: Dialogue Foundation Model for Science

SciDFM is the pioneering open-sourced dialogue foundation model tailored for science, which integrates a mixture-of-experts architecture into a transformer-based framework, aiming at enhancing its sophisticated scientific reasoning and understanding capabilities. SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reachs a SOTA performance on domain-specific benchmark among models of similar size.

## News
* **2024-06-28** The parameter of SciDFM-MoE-A5.6B-v1.0 is open-soursed! Technical report is coming soon.

## Model Details

SciDFM is based on a transformer architecture, and follows modifications of Llama, i.e. RMSNorm, RoPE and SwiGLU. SciDFM use the same hyper-parameters of OpenLLaMa-3B. And in order to better model knowledge of different disciplines, we replace the feed-forward block with Mixture-of-Expert (MoE) layers.

## Training Details

SciDFM is pre-trained on a large corpus containing ~300B science tokens and ~270B general tokens for two epochs, resulting in about 1.1T tokens consuming. And we further fine-tune SciDFM using ~9.3M instruction-following samples for 5 epochs to improve the performances on the downstream benchmarks.

## Usage Details

### Local Inference

To load and run SciDFM locally, here is an example:

```python
import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM

model_name_or_id = "OpenDFM/SciDFM-MoE-A5.6B-v1.0"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

chat_template = "<|user|>:{instruction}<|assistant|>:"
input_text = "What is Mixture-of-Experts (MoE) in computer science?"
input_text = chat_template.format(instruction=input_text)

inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.9,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
generated_text = tokenizer.decode(outputs, skip_special_tokens=True)[0][len(input_text):]
print(generated_text.strip())
```

### SMILES preprocess

When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
```
or directly:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
    return Chem.CanonSmiles(smiles, useChiral=True)
```

### Special Tokens preprocess

If there is SMILES expression in your input, please first process it with the following function:

```python
import sentencepiece as spm

smiles_model = spm.SentencePieceProcessor(model_file="smiles.model")

def convert_smiles(smiles_str):
   pieces = smiles_model.encode_as_pieces(smiles_str)[1:]
   smiles = "".join([f"[ChemDFM_Start_SMILES_Unit]{piece}[ChemDFM_End_SMILES_Unit]" for piece in pieces])
   return smiles

convert_smiles("C(C(=O)O)N")
```

And if there is protein sequece in your input, please first process it with the following function:

```python
def convert_protein(p_str):
   res = [f"<<protein>>{s}" for s in p_str]
   return "".join(res)

convert_protein("MIRLGAPQTL")
```

## Evaluation

We briefly compare SciDFM-MoE-A5.6B-v1.0 with similar-sized instruction-tuned LLMs on scientific evaluation benchmarks. The results are shown below:
| Model              | SciEval | SciQ  | ARC\_c | ARC\_e | GSM8K | MATH  | MedQA | MMCQA | PMQA  | Avg   |
|--------------------|---------|-------|--------|--------|-------|-------|-------|-------|-------|-------|
| LLaMa2-7B          | 27.06   | 57.00 | 36.43  | 46.59  | 3.94  | 3.96  | 26.32 | 29.84 | 66.80 | 32.95 |
| Galactica-6.7B     | 46.28   | 74.20 | 44.28  | 61.83  | 2.80  | 6.32  | 30.48 | 36.46 | 48.80 | 38.91 |
| LLaMa2-13B         | 33.88   | 78.10 | 56.66  | 72.35  | 22.82 | 3.90  | 32.68 | 34.28 | 77.80 | 45.45 |
| ChatGLM2-6B        | 54.25   | 75.80 | 57.08  | 73.57  | 25.09 | 7.18  | 27.42 | 34.21 | 60.40 | 45.94 |
| Galactica-30B      | 54.24   | 83.10 | 57.85  | 75.04  | 13.65 | 8.66  | 37.71 | 48.43 | 58.80 | 48.35 |
| LLaMa3-8B          | 59.70   | 90.00 | 71.16  | 84.05  | 5.91  | 7.00  | 48.78 | 52.74 | 26.60 | 49.59 |
| ChatGLM3-6B        | 51.13   | 77.60 | 60.84  | 75.97  | 60.27 | 23.52 | 24.59 | 31.39 | 51.80 | 50.53 |
| SciGLM-6B          | 61.22   | 88.70 | 77.47  | 86.57  | 42.23 | 16.40 | 42.81 | 44.94 | 73.60 | 59.12 |
| SciDFM             | 62.48   | 88.00 | 64.76  | 81.48  | 59.14 | 27.28 | 44.54 | 53.10 | 78.00 | 61.56 |
| ChatGLM3-6B-base   | 60.34   | 89.00 | 78.58  | 87.37  | 59.82 | 22.64 | 42.73 | 45.14 | 74.40 | 61.96 |
| Llama3-8B-Instruct | 64.91   | 91.60 | 76.45  | 87.33  | 76.57 | 26.26 | 56.48 | 59.31 | 72.00 | 67.44 |

## Citation

```
comming soon...
```