add example training script
Browse files- 2023-08-14-mace-universal.sbatch +59 -0
- README.md +8 -2
2023-08-14-mace-universal.sbatch
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
#SBATCH -C gpu
|
3 |
+
#SBATCH -G 40
|
4 |
+
#SBATCH -N 10
|
5 |
+
#SBATCH --ntasks=40
|
6 |
+
#SBATCH --ntasks-per-node=4
|
7 |
+
#SBATCH --cpus-per-task=4
|
8 |
+
#SBATCH --time=6:00:00
|
9 |
+
#SBATCH --time-min=02:00:00
|
10 |
+
#SBATCH --error=%x-%j.err
|
11 |
+
#SBATCH --output=%x-%j.out
|
12 |
+
#SBATCH --requeue
|
13 |
+
#SBATCH --exclusive
|
14 |
+
#SBATCH --open-mode=append
|
15 |
+
|
16 |
+
exp_name=$(basename "$SLURM_SUBMIT_DIR")
|
17 |
+
|
18 |
+
srun python run_train.py \
|
19 |
+
--name=$exp_name \
|
20 |
+
--train_file="train.h5" \
|
21 |
+
--valid_file="valid.h5" \
|
22 |
+
--statistics_file="statistics.json" \
|
23 |
+
--energy_weight=1 \
|
24 |
+
--forces_weight=1 \
|
25 |
+
--eval_interval=1 \
|
26 |
+
--config_type_weights='{"Default":1.0}' \
|
27 |
+
--E0s='average' \
|
28 |
+
--error_table='PerAtomMAE' \
|
29 |
+
--stress_key='stress' \
|
30 |
+
--model="ScaleShiftMACE" \
|
31 |
+
--MLP_irreps="64x0e" \
|
32 |
+
--interaction_first="RealAgnosticResidualInteractionBlock" \
|
33 |
+
--interaction="RealAgnosticResidualInteractionBlock" \
|
34 |
+
--num_interactions=2 \
|
35 |
+
--num_channels=128 \
|
36 |
+
--max_ell=3 \
|
37 |
+
--hidden_irreps='64x0e + 64x1o + 64x2e' \
|
38 |
+
--num_cutoff_basis=10 \
|
39 |
+
--lr=1e-2 \
|
40 |
+
--correlation=3 \
|
41 |
+
--r_max=6.0 \
|
42 |
+
--num_radial_basis=10 \
|
43 |
+
--scaling='rms_forces_scaling' \
|
44 |
+
--distributed \
|
45 |
+
--num_workers=4 \
|
46 |
+
--batch_size=10 \
|
47 |
+
--valid_batch_size=30 \
|
48 |
+
--max_num_epochs=500 \
|
49 |
+
--patience=250 \
|
50 |
+
--amsgrad \
|
51 |
+
--weight_decay=1e-8 \
|
52 |
+
--ema \
|
53 |
+
--ema_decay=0.999 \
|
54 |
+
--default_dtype="float32"\
|
55 |
+
--clip_grad=100 \
|
56 |
+
--device=cuda \
|
57 |
+
--seed=3 \
|
58 |
+
--save_cpu \
|
59 |
+
--restart_latest &
|
README.md
CHANGED
@@ -79,11 +79,17 @@ If you use the pretrained models in this repository, please cite all the followi
|
|
79 |
}
|
80 |
```
|
81 |
|
82 |
-
# Training
|
83 |
|
84 |
## Training Data
|
85 |
|
86 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
-
## Training Procedure
|
|
|
79 |
}
|
80 |
```
|
81 |
|
82 |
+
# Training Guide
|
83 |
|
84 |
## Training Data
|
85 |
|
86 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
87 |
|
88 |
+
For now, please download MPTrj data from [figshare](https://figshare.com/articles/dataset/Materials_Project_Trjectory_MPtrj_Dataset/23713842). We may upload to HuggingFace Datasets in the future.
|
89 |
+
|
90 |
+
## Fine-tuning
|
91 |
+
|
92 |
+
<!-- This should link to a Training Procedure Card, perhaps with a short stub of information on what the training procedure is all about as well as documentation related to hyperparameters or additional training details. -->
|
93 |
+
|
94 |
+
We provide an example multi-GPU training script [2023-08-14-mace-universal.sbatch]([2023-08-14-mace-universal.model](https://huggingface.co/cyrusyc/mace-universal/blob/main/2023-08-14-mace-universal.sbatch)), which uses 40 A100s on NERSC Perlmutter. Please see MACE `multi-gpu` branch for more detailed instructions.
|
95 |
|
|