Feature: 1st converging cloud microphysics model #83

rouson · 2023-09-11T20:03:52Z

This pull request exhibits nearly monotonic convergence as measured by the cost function decreasing 3 orders of magnitude in the first 120 epochs:

To reproduce this behavior, execute

./build/run-fpm.sh run -- --base training --epochs 120 --stride 720

with the present working directory containing the 29.7 GB training_input.nc and training_output.nc produced for the "Colorado benchmark simulation" using commit d7aa958 on the neural-net branch of https://github.com/berkeleylab/icar, which uses the simplest of ICAR's cloud microphysics models. The Inference-Engine run uses

A single time instant (as determined by the above stride),
A 30% retention rate of grid points where time derivatives vanish,
Zero initial weights and biases,
A batch size equal to the entire time instant,
Gradient descent with no optimizer, and
A single mini-batch.

The program shuffles the data set in order to facilitate stochastic gradient descent. However, because a single mini-batch is used, the cost function is computed across the entire data set, which negates the value of shuffling and thus presumably makes this gradient descent.

Because a single time instant is used, this case reflects the behavior that might be expected if Inference-Engine is integrated into ICAR and training happens during an ICAR run. In such a scenario, it might be desirable to iterate on each time instant as soon as the time step completes. Doing so might either be used to

Pretrain the network to promote faster convergence during a subsequent training session after the ICAR run using data saved from the run or
Obviate the need for saving large training data sets for subsequent training.

rm notification of the (re)writing of the network file at each time step

This commit exhibits nearly monotic convergence as measured by the cost function decreasing 3 orders of magnitude in the first 120 iterations. To reproduce this behavior, execute ./build/run-fpm.sh run -- --base training --epochs 120 --stride 720 with the present working directory containing the 29.7 GB training_input.nc and training_output.nc produced for the "Colorado benchmark simulation" using commit d7aa958 on the neural-net branch of https://github.com/berkeleylab/icar, which uses the simplest of ICAR's cloud microphysics models. The Inference-Engine run uses * A single time instant (as determined by the above stride), * A 30% retention rate of grid points where time derivatives vanish, * Zero initial weights and biases, * A batch size equal to the entire time instant, * Gradient descent with no optimizer, and * A single mini-batch. The program shuffles the data set in order to facilitate stochastic gradient descent. However, because a single mini-batch is used, the cost function is computed across the entire data set, which negates the value of shuffling and thus presumably makes this gradient descent. Because a single time instant is used, this case reflects the behavior that might be expected if Inference-Engine is integrated into ICAR and training happens during an ICAR run. In such a scenario, it might be desirable to iterate on each time instant as soon as the time step completes. Doing so might either help to pretrain the network to promote faster convergence if the data is saved for additional subsequent after the ICAR run. Alternatively, training at ICAR runtime might obviate the need for saving large training data sets.

Set the zero-time-derivative point retention to 1%, the learning rate to 3.0, and the number of hidden layers to 6.

If a "stop-training" file exists in the directory where app/train-cloud-microphysics.f90 is running, the program will initiate normal termination after completing the first epoch in which the file is found.

rouson added 6 commits September 11, 2023 11:21

refac(app): simplify loop limits

0fc1318

chore(app): clean up screen output

cb5119f

rm notification of the (re)writing of the network file at each time step

feat(example): rename example & print # layers

adceaee

feat(app/train): speed up microphysics training

7b33548

Set the zero-time-derivative point retention to 1%, the learning rate to 3.0, and the number of hidden layers to 6.

feat(app): stop-training file => graceful exit

d80416e

If a "stop-training" file exists in the directory where app/train-cloud-microphysics.f90 is running, the program will initiate normal termination after completing the first epoch in which the file is found.

rouson merged commit c51e218 into main Sep 12, 2023

rouson deleted the converge branch September 12, 2023 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: 1st converging cloud microphysics model #83

Feature: 1st converging cloud microphysics model #83

rouson commented Sep 11, 2023 •

edited

Loading

Feature: 1st converging cloud microphysics model #83

Feature: 1st converging cloud microphysics model #83

Conversation

rouson commented Sep 11, 2023 • edited Loading

rouson commented Sep 11, 2023 •

edited

Loading