Graph Machine Learning
AnemoI
English
jpxkqx commited on
Commit
e9e20d1
·
verified ·
1 Parent(s): 8320d87

Upload config_finetuning.yaml

Browse files
Files changed (1) hide show
  1. config_finetuning.yaml +483 -0
config_finetuning.yaml ADDED
@@ -0,0 +1,483 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ format: zarr
3
+ resolution: n320
4
+ frequency: 6h
5
+ timestep: 6h
6
+ forcing:
7
+ - cos_latitude
8
+ - cos_longitude
9
+ - sin_latitude
10
+ - sin_longitude
11
+ - cos_julian_day
12
+ - cos_local_time
13
+ - sin_julian_day
14
+ - sin_local_time
15
+ - insolation
16
+ - lsm
17
+ - sdor
18
+ - slor
19
+ - z
20
+ diagnostic:
21
+ - tp
22
+ - cp
23
+ - sf
24
+ - tcc
25
+ - hcc
26
+ - lcc
27
+ - mcc
28
+ - ro
29
+ - ssrd
30
+ - strd
31
+ - 100u
32
+ - 100v
33
+ remapped: null
34
+ normalizer:
35
+ default: mean-std
36
+ remap:
37
+ cp: tp
38
+ sf: tp
39
+ std:
40
+ - tp
41
+ - cp
42
+ - sf
43
+ - ro
44
+ - tcw
45
+ - ssrd
46
+ - q_50
47
+ - q_100
48
+ - q_150
49
+ - q_200
50
+ - q_250
51
+ - q_300
52
+ - q_400
53
+ - q_500
54
+ - q_600
55
+ - q_700
56
+ - q_850
57
+ - q_925
58
+ - q_1000
59
+ min-max: null
60
+ max:
61
+ - sdor
62
+ - slor
63
+ - z
64
+ none:
65
+ - cos_latitude
66
+ - cos_longitude
67
+ - sin_latitude
68
+ - sin_longitude
69
+ - cos_julian_day
70
+ - cos_local_time
71
+ - sin_julian_day
72
+ - sin_local_time
73
+ - insolation
74
+ - lsm
75
+ - tcc
76
+ - mcc
77
+ - hcc
78
+ - lcc
79
+ - swvl1
80
+ - swvl2
81
+ imputer:
82
+ default: none
83
+ remapper:
84
+ default: none
85
+ processors:
86
+ normalizer:
87
+ _target_: anemoi.models.preprocessing.normalizer.InputNormalizer
88
+ _convert_: all
89
+ config:
90
+ default: mean-std
91
+ remap:
92
+ cp: tp
93
+ sf: tp
94
+ std:
95
+ - tp
96
+ - cp
97
+ - sf
98
+ - ro
99
+ - tcw
100
+ - ssrd
101
+ - q_50
102
+ - q_100
103
+ - q_150
104
+ - q_200
105
+ - q_250
106
+ - q_300
107
+ - q_400
108
+ - q_500
109
+ - q_600
110
+ - q_700
111
+ - q_850
112
+ - q_925
113
+ - q_1000
114
+ min-max: null
115
+ max:
116
+ - sdor
117
+ - slor
118
+ - z
119
+ none:
120
+ - cos_latitude
121
+ - cos_longitude
122
+ - sin_latitude
123
+ - sin_longitude
124
+ - cos_julian_day
125
+ - cos_local_time
126
+ - sin_julian_day
127
+ - sin_local_time
128
+ - insolation
129
+ - lsm
130
+ - tcc
131
+ - mcc
132
+ - hcc
133
+ - lcc
134
+ - swvl1
135
+ - swvl2
136
+ num_features: 115
137
+
138
+ dataloader:
139
+ prefetch_factor: 2
140
+ pin_memory: True
141
+ read_group_size: 4
142
+ num_workers:
143
+ training: 8
144
+ validation: 8
145
+ test: 8
146
+ predict: 8
147
+ batch_size:
148
+ training: 1
149
+ validation: 1
150
+ test: 4
151
+ predict: 4
152
+ limit_batches:
153
+ training: 1000
154
+ validation: 10
155
+ test: 20
156
+ predict: 20
157
+ dataset: ${hardware.paths.data}/${hardware.files.dataset}
158
+ land_dataset: ${hardware.paths.data}/${hardware.files.dataset_land}
159
+ land_variables: [100u, 100v, swvl1, swvl2, stl1, stl2, tcc, lcc, mcc, hcc, sf, ro, strd, ssrd]
160
+ training:
161
+ dataset:
162
+ - dataset: ${dataloader.dataset}
163
+ start: null
164
+ end: 2022
165
+ frequency: ${data.frequency}
166
+ drop: []
167
+ - dataset: ${dataloader.land_dataset}
168
+ start: null
169
+ end: 2022
170
+ frequency: ${data.frequency}
171
+ select: ${dataloader.land_variables}
172
+ start: null
173
+ end: 2022
174
+ drop: []
175
+ validation:
176
+ dataset:
177
+ - dataset: ${dataloader.dataset}
178
+ start: 2022
179
+ end: 2022
180
+ frequency: ${data.frequency}
181
+ drop: []
182
+ - dataset: ${dataloader.land_dataset}
183
+ start: 2022
184
+ end: 2022
185
+ frequency: ${data.frequency}
186
+ select: ${dataloader.land_variables}
187
+ start: 2022
188
+ end: 2022
189
+ drop: []
190
+ validation_rollout: 1
191
+
192
+ diagnostics:
193
+ plot:
194
+ asynchronous: False
195
+ datashader: True
196
+ frequency:
197
+ batch: 750
198
+ epoch: 10
199
+ parameters: [tp]
200
+ sample_idx: 0
201
+ precip_and_related_fields: [tp, cp]
202
+ callbacks: []
203
+ enabled: True
204
+ scatter: False
205
+ mode: asyncio
206
+ callbacks: {}
207
+ benchmark_profiler:
208
+ memory:
209
+ enabled: True
210
+ steps: 5
211
+ warmup: 2
212
+ extra_plots: False
213
+ trace_rank0_only: False
214
+ time:
215
+ enabled: True
216
+ verbose: False
217
+ speed:
218
+ enabled: True
219
+ system:
220
+ enabled: True
221
+ model_summary:
222
+ enabled: True
223
+ snapshot:
224
+ enabled: True
225
+ steps: 4
226
+ warmup: 0
227
+ debug:
228
+ anomaly_detection: False
229
+ profiler: False
230
+ enable_checkpointing: True
231
+ checkpoint:
232
+ every_n_minutes:
233
+ save_frequency: 30
234
+ num_models_saved: 3
235
+ every_n_epochs:
236
+ save_frequency: 1
237
+ num_models_saved: 3
238
+ every_n_train_steps:
239
+ save_frequency: null
240
+ num_models_saved: 0
241
+ log:
242
+ wandb:
243
+ enabled: False
244
+ tensorboard:
245
+ enabled: False
246
+ mlflow:
247
+ enabled: False
248
+ interval: 100
249
+ enable_progress_bar: True
250
+ print_memory_summary: False
251
+
252
+ hardware:
253
+ paths:
254
+ data: ${oc.decode:${oc.env:DATASETS_PATH}}
255
+ output: ${oc.decode:${oc.env:OUTPUT_DIR}}
256
+ logs:
257
+ base: ${hardware.paths.output}/logs
258
+ wandb: ${hardware.paths.output}/logs/wandb
259
+ mlflow: ${hardware.paths.output}/logs/mlflow
260
+ tensorboard: ${hardware.paths.output}/logs/tensorboard
261
+ checkpoints: ${hardware.paths.output}/checkpoint/
262
+ plots: ${hardware.paths.output}/plots/
263
+ profiler: ${hardware.paths.output}/profiler/
264
+ graph: ${hardware.paths.output}/graphs/
265
+ files:
266
+ dataset: aifs-od-an-oper-0001-mars-n320-2016-2023-6h-v6.zarr
267
+ dataset_land: aifs-od-an-oper-0001-mars-n320-2016-2023-6h-v1-land.zarr
268
+ graph: graph_enc_proc_dec_n320.pt
269
+ checkpoint:
270
+ every_n_epochs: aifs-by_epoch-epoch_{epoch:03d}-val_wmse_{val_wmse:.3e}
271
+ every_n_train_steps: aifs-by_step-epoch_{epoch:03d}-step_{step:06d}
272
+ every_n_minutes: aifs-by_time-epoch_{epoch:03d}-step_{step:06d}
273
+ warm_start: null
274
+ accelerator: auto
275
+ num_gpus_per_node: 4
276
+ num_nodes: 16
277
+ num_gpus_per_model: 4
278
+
279
+ graph:
280
+ overwrite: True
281
+ data: data
282
+ hidden: hidden
283
+ nodes:
284
+ data:
285
+ node_builder:
286
+ _target_: anemoi.graphs.nodes.ZarrDatasetNodes
287
+ dataset: ${dataloader.dataset}
288
+ attributes:
289
+ area_weight:
290
+ _target_: anemoi.graphs.nodes.attributes.AreaWeights
291
+ norm: unit-max
292
+ hidden:
293
+ node_builder:
294
+ _target_: anemoi.graphs.nodes.ReducedGaussianGridNodes
295
+ grid: o96
296
+ edges:
297
+ - source_name: data
298
+ target_name: hidden
299
+ edge_builder:
300
+ _target_: anemoi.graphs.edges.CutOffEdges
301
+ cutoff_factor: 0.6
302
+ attributes:
303
+ edge_length:
304
+ _target_: anemoi.graphs.edges.attributes.EdgeLength
305
+ norm: unit-std
306
+ edge_dirs:
307
+ _target_: anemoi.graphs.edges.attributes.EdgeDirection
308
+ norm: unit-std
309
+ - source_name: hidden
310
+ target_name: data
311
+ edge_builder:
312
+ _target_: anemoi.graphs.edges.KNNEdges
313
+ num_nearest_neighbours: 3
314
+ attributes:
315
+ edge_length:
316
+ _target_: anemoi.graphs.edges.attributes.EdgeLength
317
+ norm: unit-std
318
+ edge_dirs:
319
+ _target_: anemoi.graphs.edges.attributes.EdgeDirection
320
+ norm: unit-std
321
+ attributes:
322
+ nodes:
323
+ area_weight:
324
+ _target_: anemoi.graphs.nodes.attributes.AreaWeights
325
+ norm: unit-max
326
+ edges:
327
+ edge_length:
328
+ _target_: anemoi.graphs.edges.attributes.EdgeLength
329
+ norm: unit-std
330
+ edge_dirs:
331
+ _target_: anemoi.graphs.edges.attributes.EdgeDirection
332
+ norm: unit-std
333
+
334
+ model:
335
+ activation: GELU
336
+ num_channels: 1024
337
+ model:
338
+ _target_: anemoi.models.models.encoder_processor_decoder.AnemoiModelEncProcDec
339
+ processor:
340
+ _target_: anemoi.models.layers.processor.TransformerProcessor
341
+ _convert_: all
342
+ activation: GELU
343
+ num_layers: 16
344
+ num_chunks: 2
345
+ mlp_hidden_ratio: 4
346
+ num_heads: 16
347
+ window_size: 1120
348
+ dropout_p: 0.0
349
+ encoder:
350
+ _target_: anemoi.models.layers.mapper.GraphTransformerForwardMapper
351
+ _convert_: all
352
+ trainable_size: 8
353
+ sub_graph_edge_attributes: [edge_length, edge_dirs]
354
+ activation: GELU
355
+ num_chunks: 1
356
+ mlp_hidden_ratio: 4
357
+ num_heads: 16
358
+ decoder:
359
+ _target_: anemoi.models.layers.mapper.GraphTransformerBackwardMapper
360
+ _convert_: all
361
+ trainable_size: 8
362
+ sub_graph_edge_attributes: [edge_length, edge_dirs]
363
+ activation: GELU
364
+ num_chunks: 1
365
+ mlp_hidden_ratio: 4
366
+ num_heads: 16
367
+ trainable_parameters:
368
+ data: 8
369
+ hidden: 8
370
+ data2hidden: 8
371
+ hidden2data: 8
372
+ attributes:
373
+ edges: [edge_length, edge_dirs]
374
+ nodes: []
375
+ node_loss_weight: area_weight
376
+ bounding:
377
+ - _target_: anemoi.models.layers.bounding.ReluBounding
378
+ variables:
379
+ - tp
380
+ - ro
381
+ - tcw
382
+ - ssrd
383
+ - q_50
384
+ - q_100
385
+ - q_150
386
+ - q_200
387
+ - q_250
388
+ - q_300
389
+ - q_400
390
+ - q_500
391
+ - q_600
392
+ - q_700
393
+ - q_850
394
+ - q_925
395
+ - q_1000
396
+ - _target_: anemoi.models.layers.bounding.HardtanhBounding
397
+ variables: [tcc, swvl1, swvl2]
398
+ min_val: 0
399
+ max_val: 1
400
+ - _target_: anemoi.models.layers.bounding.FractionBounding
401
+ variables: [cp, sf]
402
+ min_val: 0
403
+ max_val: 1
404
+ total_var: tp
405
+ - _target_: anemoi.models.layers.bounding.FractionBounding
406
+ variables: [lcc, mcc, hcc]
407
+ min_val: 0
408
+ max_val: 1
409
+ total_var: tcc
410
+
411
+ training:
412
+ run_id: ${oc.decode:${oc.env:PRETRAINING_RUN_ID}}
413
+ fork_run_id: ${oc.decode:${oc.env:PRETRAINING_RUN_ID}}
414
+ load_weights_only: True
415
+ deterministic: False
416
+ precision: 16-mixed
417
+ multistep_input: 2
418
+ accum_grad_batches: 1
419
+ num_sanity_val_steps: 6
420
+ gradient_clip:
421
+ val: 32.0
422
+ algorithm: value
423
+ swa:
424
+ enabled: False
425
+ lr: 0.0001
426
+ zero_optimizer: False
427
+ training_loss:
428
+ _target_: anemoi.training.losses.mse.WeightedMSELoss
429
+ scalars:
430
+ - variable
431
+ - loss_weights_mask
432
+ ignore_nans: False
433
+ loss_gradient_scaling: False
434
+ validation_metrics:
435
+ - _target_: anemoi.training.losses.mse.WeightedMSELoss
436
+ scalars: []
437
+ ignore_nans: True
438
+ rollout:
439
+ start: 1
440
+ epoch_increment: 1
441
+ max: 12
442
+ max_epochs: 13
443
+ max_steps: 150000
444
+ lr:
445
+ rate: 8.0e-7
446
+ iterations: 7900
447
+ min: 3.0e-7
448
+ warmup_t: 100
449
+ variable_loss_scaling:
450
+ default: 1
451
+ pl:
452
+ q: 0.6
453
+ t: 6
454
+ u: 0.8
455
+ v: 0.5
456
+ w: 0.001
457
+ z: 12
458
+ sfc:
459
+ sp: 10
460
+ 10u: 0.5
461
+ 10v: 0.5
462
+ 100u: 0.1
463
+ 100v: 0.1
464
+ 2d: 0.5
465
+ tp: 0.025
466
+ cp: 0.0025
467
+ ro: 0.005
468
+ sf: 0.025
469
+ tcc: 0.1
470
+ mcc: 0.1
471
+ lcc: 0.1
472
+ hcc: 0.1
473
+ swvl2: 200
474
+ swvl1: 100
475
+ stl2: 10
476
+ stl1: 1
477
+ ssrd: 0.05
478
+ strd: 0.1
479
+ metrics: [z_500, t_850, u_850, v_850]
480
+ pressure_level_scaler:
481
+ _target_: anemoi.training.data.scaling.ReluPressureLevelScaler
482
+ minimum: 0.2
483
+ slope: 0.001