Apparent unexpected difference in results when re-using a binary Network file #116

bwynneHEP · 2020-04-21T15:27:21Z

This is posted from the UK RAMP “RED TEAM” at Edinburgh University, as invited through UKRI and RAMP with agreement of Microsoft. We have taken the charge of kicking the tires on the code as a pseudo public user. We are testing documentation, stability and usability based upon the documentation we can find on the git repository. Thus, if we report something that is “well known” then it is in effect a report on the lack of documentation which would have made this clear.

Issue: Apparent unexpected difference in results when re-using a binary Network file

The expected mode of operation for the covid-sim package seems to be a single run with nominal conditions, followed by subsequent runs with variations to parameters: this is logical.

Time-saving measures are implemented by allowing the nominal run to save binary representations of the population density and network data. These files can be loaded and re-used in subsequent runs, rather than being re-created each time (using the /S: and /L: command line arguments)

However, there appears to be a bug in either the creation or re-use of the network file. If we attempt two completely identical runs, only varying in that the second should use the network file produced by the first, the results are quite different:

This mostly manifests as a time-offset, but there are numerical changes beyond that.

Re-use of the binary population density file seems to have no associated problems.

matt-gretton-dann · 2020-04-21T17:29:49Z

Thank you for raising this issue.

We are aware of some small non-determinisms when using multiple threads to set up the network of people and places. (Look for the omp critical pragmas in the code). This has historically been considered acceptable because of the general stochastic nature of the model. We do expect the code to be deterministic when using a single-thread.

Please also note that the networks are highly sensitive to the input files and command line options. The network files are also platform dependent.

We are planning on making the parsing and reading of input files more robust.

It would be helpful if you could give us the command lines you were using to demonstrate these differences, so that we can investigate further.

bwynneHEP · 2020-04-21T17:57:02Z

This is on the same machine, running single threaded, with no variation in parameters. Completely identical command lines other than replacing /S: with /L: in the second case.

You can test this with any example you like, but I can provide the specific commands I ran if you don't mind all the paths and input files being local to my machine:

/home/bwynne/RAMP/ParameterScan/build/CovidSim /c:1 /A:/home/bwynne/RAMP/ParameterScan/justWales.txt /PP:/home/bwynne/covid-sim/data/param_files/preUK_R0=2.0.txt /P:/home/bwynne/covid-sim/data/param_files/p_PC7_CI_HQ_SD.txt /O:/home/bwynne/RAMP/ParameterScan/referenceOutput /D:/home/bwynne/RAMP/ParameterScan/referenceOutput/wpop_eur.txt /M:/home/bwynne/RAMP/ParameterScan/referenceOutput/United_Kingdom_pop_density.bin /S:/home/bwynne/RAMP/ParameterScan/referenceOutput/Network_United_Kingdom_T1_R3.0.bin /R:1.5 98798150 729101 17389101 4797132
/home/bwynne/RAMP/ParameterScan/build/CovidSim /c:1 /A:/home/bwynne/RAMP/ParameterScan/justWales.txt /PP:/home/bwynne/covid-sim/data/param_files/preUK_R0=2.0.txt /P:/home/bwynne/covid-sim/data/param_files/p_PC7_CI_HQ_SD.txt /O:/home/bwynne/RAMP/ParameterScan/referenceOutput /D:/home/bwynne/RAMP/ParameterScan/referenceOutput/wpop_eur.txt /M:/home/bwynne/RAMP/ParameterScan/referenceOutput/United_Kingdom_pop_density.bin /L:/home/bwynne/RAMP/ParameterScan/referenceOutput/Network_United_Kingdom_T1_R3.0.bin /R:1.5 98798150 729101 17389101 4797132

You can see numerical differences in the results output to stdout:
t=719 607717 0|0 2470444 21839 [=3100000] 21839 (0 0 0) 0.00186767
versus
t=719 611487 0|0 2466630 21883 [=3100000] 21883 (0 0 0) 0.00186767

You can also compare the severity/age outputs - they differ significantly as I showed in the plots above.

Simply removing the "/L:" argument causes the code to run correctly, albeit without the speed improvement of re-using the network.

Note that this is not a random fluctuation but a constant effect: multiple sets of results all re-using the same network file (with no variation of parameters) will be identical, all with the same discrepancy from the original.

bwynneHEP · 2020-04-21T18:54:53Z

Regarding randomness:

If the RNG is used in the generation of the network, but not when a cached network is loaded, this might explain the discrepancy. The random sequence in the former case would be advanced relative to the latter.

This would produce a constant effect since the random seeds are identical for the jobs. A simple fix would be to re-initialise the RNG after generation of the network.

However, it could of course be a straightforward bug in the loading/saving of the file.

There were a couple of places where we use P.newseed1 = (int)(ranf() * 1e8); to make new seeds, but Rand's setall already implements RNG splitting, so we now use that instead. This needed a few knock-on changes in the places where we seed the RNG, as if we wanted to reuse the seed values then we now need to store them. Finally, we now also split the seeds used for the setup phase, and reinitialise the RNG with them after the network has been created or loaded. It looks like historically the second pair of seeds had been used at this point, to make the runs identical regardless of how the network was made, but that this had been changed when seed-resetting was implemented. Now the determinism is back, fixing #116.

matt-gretton-dann self-assigned this Apr 21, 2020

matt-gretton-dann assigned igfoo and unassigned matt-gretton-dann Apr 21, 2020

igfoo mentioned this issue Apr 23, 2020

Refactor setting of RNG seeds #121

Merged

matt-gretton-dann linked a pull request Apr 25, 2020 that will close this issue

Refactor setting of RNG seeds #121

Merged

matt-gretton-dann closed this as completed in #121 Apr 25, 2020

This was referenced May 17, 2020

Replace monolithic design with modular code #208

Closed

Decompose into pure functions and eliminate mutable globals. #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent unexpected difference in results when re-using a binary Network file #116

Apparent unexpected difference in results when re-using a binary Network file #116

bwynneHEP commented Apr 21, 2020

matt-gretton-dann commented Apr 21, 2020

bwynneHEP commented Apr 21, 2020

bwynneHEP commented Apr 21, 2020

Apparent unexpected difference in results when re-using a binary Network file #116

Apparent unexpected difference in results when re-using a binary Network file #116

Comments

bwynneHEP commented Apr 21, 2020

matt-gretton-dann commented Apr 21, 2020

bwynneHEP commented Apr 21, 2020

bwynneHEP commented Apr 21, 2020