Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: 1st converging cloud microphysics model #83

Merged
merged 6 commits into from
Sep 12, 2023
Merged

Feature: 1st converging cloud microphysics model #83

merged 6 commits into from
Sep 12, 2023

Conversation

rouson
Copy link
Contributor

@rouson rouson commented Sep 11, 2023

This pull request exhibits nearly monotonic convergence as measured by the cost function decreasing 3 orders of magnitude in the first 120 epochs:

here

To reproduce this behavior, execute

./build/run-fpm.sh run -- --base training --epochs 120 --stride 720

with the present working directory containing the 29.7 GB training_input.nc and training_output.nc produced for the "Colorado benchmark simulation" using commit d7aa958 on the neural-net branch of https://github.com/berkeleylab/icar, which uses the simplest of ICAR's cloud microphysics models. The Inference-Engine run uses

  • A single time instant (as determined by the above stride),
  • A 30% retention rate of grid points where time derivatives vanish,
  • Zero initial weights and biases,
  • A batch size equal to the entire time instant,
  • Gradient descent with no optimizer, and
  • A single mini-batch.

The program shuffles the data set in order to facilitate stochastic gradient descent. However, because a single mini-batch is used, the cost function is computed across the entire data set, which negates the value of shuffling and thus presumably makes this gradient descent.

Because a single time instant is used, this case reflects the behavior that might be expected if Inference-Engine is integrated into ICAR and training happens during an ICAR run. In such a scenario, it might be desirable to iterate on each time instant as soon as the time step completes. Doing so might either be used to

  • Pretrain the network to promote faster convergence during a subsequent training session after the ICAR run using data saved from the run or
  • Obviate the need for saving large training data sets for subsequent training.

rm notification of the (re)writing of the network file at each
time step
This commit exhibits nearly monotic convergence as measured by the
cost function decreasing 3 orders of magnitude in the first 120
iterations.  To reproduce this behavior, execute

./build/run-fpm.sh run -- --base training --epochs 120 --stride 720

with the present working directory containing the 29.7 GB
training_input.nc and training_output.nc produced for the
"Colorado benchmark simulation" using commit d7aa958 on the
neural-net branch of https://github.com/berkeleylab/icar,
which uses the simplest of ICAR's cloud microphysics models.
The Inference-Engine run uses

* A single time instant (as determined by the above stride),
* A 30% retention rate of grid points where time derivatives vanish,
* Zero initial weights and biases,
* A batch size equal to the entire time instant,
* Gradient descent with no optimizer, and
* A single mini-batch.

The program shuffles the data set in order to facilitate
stochastic gradient descent. However, because a single mini-batch
is used, the cost function is computed across the entire data set,
which negates the value of shuffling and thus presumably makes this
gradient descent.

Because a single time instant is used, this case reflects the
behavior that might be expected if Inference-Engine is integrated
into ICAR and training happens during an ICAR run.  In such a
scenario, it might be desirable to iterate on each time instant
as soon as the time step completes.  Doing so might either help
to pretrain the network to promote faster convergence if the
data is saved for additional subsequent after the ICAR run.
Alternatively, training at ICAR runtime might obviate the need
for saving large training data sets.
Set the zero-time-derivative point retention to 1%, the learning
rate to 3.0, and the number of hidden layers to 6.
If a "stop-training" file exists in the directory where
app/train-cloud-microphysics.f90 is running, the program will
initiate normal termination after completing the first epoch
in which the file is found.
@rouson rouson merged commit c51e218 into main Sep 12, 2023
@rouson rouson deleted the converge branch September 12, 2023 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant