Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent unexpected difference in results when re-using a binary Network file #116

Closed
bwynneHEP opened this issue Apr 21, 2020 · 4 comments · Fixed by #121
Closed

Apparent unexpected difference in results when re-using a binary Network file #116

bwynneHEP opened this issue Apr 21, 2020 · 4 comments · Fixed by #121
Assignees

Comments

@bwynneHEP
Copy link
Collaborator

This is posted from the UK RAMP “RED TEAM” at Edinburgh University, as invited through UKRI and RAMP with agreement of Microsoft. We have taken the charge of kicking the tires on the code as a pseudo public user. We are testing documentation, stability and usability based upon the documentation we can find on the git repository. Thus, if we report something that is “well known” then it is in effect a report on the lack of documentation which would have made this clear.

Issue: Apparent unexpected difference in results when re-using a binary Network file

The expected mode of operation for the covid-sim package seems to be a single run with nominal conditions, followed by subsequent runs with variations to parameters: this is logical.

Time-saving measures are implemented by allowing the nominal run to save binary representations of the population density and network data. These files can be loaded and re-used in subsequent runs, rather than being re-created each time (using the /S: and /L: command line arguments)

However, there appears to be a bug in either the creation or re-use of the network file. If we attempt two completely identical runs, only varying in that the second should use the network file produced by the first, the results are quite different:

NetworkReuse

This mostly manifests as a time-offset, but there are numerical changes beyond that.

Re-use of the binary population density file seems to have no associated problems.

@matt-gretton-dann
Copy link
Collaborator

Thank you for raising this issue.

We are aware of some small non-determinisms when using multiple threads to set up the network of people and places. (Look for the omp critical pragmas in the code). This has historically been considered acceptable because of the general stochastic nature of the model. We do expect the code to be deterministic when using a single-thread.

Please also note that the networks are highly sensitive to the input files and command line options. The network files are also platform dependent.

We are planning on making the parsing and reading of input files more robust.

It would be helpful if you could give us the command lines you were using to demonstrate these differences, so that we can investigate further.

@matt-gretton-dann matt-gretton-dann self-assigned this Apr 21, 2020
@bwynneHEP
Copy link
Collaborator Author

This is on the same machine, running single threaded, with no variation in parameters. Completely identical command lines other than replacing /S: with /L: in the second case.

You can test this with any example you like, but I can provide the specific commands I ran if you don't mind all the paths and input files being local to my machine:

/home/bwynne/RAMP/ParameterScan/build/CovidSim /c:1 /A:/home/bwynne/RAMP/ParameterScan/justWales.txt /PP:/home/bwynne/covid-sim/data/param_files/preUK_R0=2.0.txt /P:/home/bwynne/covid-sim/data/param_files/p_PC7_CI_HQ_SD.txt /O:/home/bwynne/RAMP/ParameterScan/referenceOutput /D:/home/bwynne/RAMP/ParameterScan/referenceOutput/wpop_eur.txt /M:/home/bwynne/RAMP/ParameterScan/referenceOutput/United_Kingdom_pop_density.bin /S:/home/bwynne/RAMP/ParameterScan/referenceOutput/Network_United_Kingdom_T1_R3.0.bin /R:1.5 98798150 729101 17389101 4797132
/home/bwynne/RAMP/ParameterScan/build/CovidSim /c:1 /A:/home/bwynne/RAMP/ParameterScan/justWales.txt /PP:/home/bwynne/covid-sim/data/param_files/preUK_R0=2.0.txt /P:/home/bwynne/covid-sim/data/param_files/p_PC7_CI_HQ_SD.txt /O:/home/bwynne/RAMP/ParameterScan/referenceOutput /D:/home/bwynne/RAMP/ParameterScan/referenceOutput/wpop_eur.txt /M:/home/bwynne/RAMP/ParameterScan/referenceOutput/United_Kingdom_pop_density.bin /L:/home/bwynne/RAMP/ParameterScan/referenceOutput/Network_United_Kingdom_T1_R3.0.bin /R:1.5 98798150 729101 17389101 4797132

You can see numerical differences in the results output to stdout:
t=719 607717 0|0 2470444 21839 [=3100000] 21839 (0 0 0) 0.00186767
versus
t=719 611487 0|0 2466630 21883 [=3100000] 21883 (0 0 0) 0.00186767

You can also compare the severity/age outputs - they differ significantly as I showed in the plots above.

Simply removing the "/L:" argument causes the code to run correctly, albeit without the speed improvement of re-using the network.

Note that this is not a random fluctuation but a constant effect: multiple sets of results all re-using the same network file (with no variation of parameters) will be identical, all with the same discrepancy from the original.

@bwynneHEP
Copy link
Collaborator Author

Regarding randomness:

If the RNG is used in the generation of the network, but not when a cached network is loaded, this might explain the discrepancy. The random sequence in the former case would be advanced relative to the latter.

This would produce a constant effect since the random seeds are identical for the jobs. A simple fix would be to re-initialise the RNG after generation of the network.

However, it could of course be a straightforward bug in the loading/saving of the file.

igfoo added a commit that referenced this issue Apr 23, 2020
There were a couple of places where we use
    P.newseed1 = (int)(ranf() * 1e8);
to make new seeds, but Rand's setall already implements RNG splitting,
so we now use that instead.

This needed a few knock-on changes in the places where we seed the RNG,
as if we wanted to reuse the seed values then we now need to store them.

Finally, we now also split the seeds used for the setup phase, and
reinitialise the RNG with them after the network has been created or
loaded. It looks like historically the second pair of seeds had been
used at this point, to make the runs identical regardless of how the
network was made, but that this had been changed when seed-resetting was
implemented. Now the determinism is back, fixing #116.
@matt-gretton-dann matt-gretton-dann linked a pull request Apr 25, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@igfoo @matt-gretton-dann @bwynneHEP and others