-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different parca output on Sherlock vs local #931
Comments
I've found simData from runParca.py to differ between my local machine and google cloud machines, and between different google cloud machines (one was running Debian, the other Ubuntu). That was true even before the py3 migration |
Hypotheses:
|
Serial vs parallel results were the same locally and on Sherlock so it seems to be reproducible on a given machine, which is good, but not reproducible between. It is interesting that this showed up before py3 on google cloud but it was at least consistent on my machine vs sherlock before the switch. Certainly some good hypotheses Jerry! |
I checked again and it looks like we might have consistent parca outputs (maybe it was a library issue with the recent update to Theano) but there is still a sim diff between the two. It looks like we've run into another OpenBLAS thread issue. I pushed a branch that does a dot product (arrays were pulled from a sim that was producing different results) and produces different results depending on the OPENBLAS_NUM_THREADS environment variable. I'm not sure if there's an easy way to turn this into a unit test since I don't think we can change the number of threads after loading numpy but this can at least show the problem. Local machine:
Sherlock with wcEcoli3 and 8 cpus:
The old pyenv (wcEcoli2) on Sherlock gave consistent results:
|
Awesome! A compact, repeatable test case is usually the hardest step in debugging. We could probably make a unit test that uses multiprocessing to exec a process with a different The environment variable |
I turned test-openblas into a unit test. Here's the output on my Mac, running with Numpy's embedded OpenBLAS:
|
Nice! I guess now we just need to check versions of openblas that would produce the right results and then we can add that unit test in. We could also check the number of cpus available and only run up to that number instead of running 12 each time. |
I made a small update to the branch to include a test with OPENBLAS_NUM_THREADS being unset and only up to the number of available CPUs |
Cool. I'm testing this in various environments, so far just on my Mac:
@U8NWXD were your tests on GCloud in a Docker Container? What was Q: Parallelizing vector dot presumably divides the two vectors into one chunk per thread and adds their portions of the dot product. Floating point rounding will thus vary with the number of chunks. How much variation should be acceptable? Where to set the Some measurements:
|
Maybe this is just a problem that can't be solved if we allow parallelization of numpy if we'll always have issues with floating point rounding. I think the main issue arises from the stochastic nature of our sims. Any bit of floating point imprecision could lead to a slightly different random result which produces one molecule instead of another which can greatly amplify the difference (even if it's only off by 1e-17 but produces an essential protein that is needed vs some other protein). I think for that reason it would need to have 0 tolerance for the difference. wcEcoli2 on Sherlock produced the same results no matter how many threads were specified but that might just be because the OpenBLAS library was installed to only use 1 thread and not actually an issue with any of the other OpenBLAS versions tried. I've run with OPENBLAS_NUM_THREADS=1 for a long time which is why I could always reproduce the issues from Jenkins run on Sherlock with wcEcoli2. |
@1fish2 I was not using Docker, and OPENBLAS_NUM_THREADS was not set. The number of threads available to openblas could have explained the different behavior |
Agreed. We might have to limit to
"NO_AVX2=1" produced identical results with 1-5 threads and slightly closer results with 6+ threads. What does that imply? It's not the wildly wrong results that made us set "NO_AVX2=1" for Docker for Mac. I guess using the vectorization hardware entails breaking the chunks into smaller vectorization subchunks. (FWIW, 3 runs each at "NO_AVX2=1" and "NO_AVX2=0" took essentially the same amount of elapsed time, but this duration is probably dominated by process exec/quit, piping, and printing.) |
BTW the embedded openblas in numpy & scipy gets the same results in this test as the manually-compiled |
Hypotheses:
Q. Is this OpenBLAS THREADs difference the only cause of parca differences? @tahorst did this test data come from a parca run? |
More tests on Mac
@tahorst are you getting similar results on your local machine? Did it ever produce consistent results on macOS with varying Hypotheses left?
|
I first noticed a difference between Sherlock and local runs when I couldn't reproduce the Jenkins failures locally. I checked the parca output and it showed differences. Then there were some updates (newer Theano version and other changes) and it looked like parca output was the same but sim output differed so I don't think there are actually any parca differences any more (it could be worth it to confirm again with THREADS unset). This test data is actually from a sim calculating the mass update for a time step.
I get the same results for all local environments tested and it looks like it matches you up to your 5 threads: Python 3.8.3, numpy==1.19.0, openblas default from numpy
I'm thinking at this point we just make sure we set OPENBLAS_NUM_THREADS=1 (or install openblas with an option for only one thread). It seems like the OpenBLAS library linked on Sherlock for wcEcoli2 (0.3.9) is the only one that is independent from the threads specified. Did you test OpenBLAS version 0.3.9 in any other setups? |
I got the unit test pass locally by installing a single thread version of OpenBLAS (instructions from Sherlock with
I think we can either specify instructions to install OpenBLAS with a single thread or use OPENBLAS_NUM_THREADS=1 and that should fix this issue. |
Yes, and lately I've been using 0.3.10 on my Mac and in Initially we had to clone the OpenBLAS repo to get an unreleased bug fix. Now there are downloadable releases up to 0.3.10. Travis found that building OpenBLAS with |
Testing a cloud-built wcm-code using the
(1)
That further validates Travis' idea that we can build Docker Images with (2) That's surprising since the macOS host outside Docker was not getting consistent results (see above) but -- whoa! -- it does now! I do not know change made that happen but testing with a Python 3.8.5 virtualenv that has (3) Inside Docker-for-Mac without the @tahorst, does this happen in Docker-for-Windows? (4) I tried keeping the openblas build directory, then running
(5) runParca gets nearly done then prints:
and exits with code 137. ==> I had to adjust Docker Desktop Preferences/Resources/Memory from 2GB to 4GB. (Would 3GB suffice?) I'm adding a note to the README. (6) |
The results were inside a Docker container? Do you think there was only 1 CPU available when it was built in the cloud so when compiling OpenBLAS, the max number of threads was set to 1?
Docker-for-Windows actually runs through WSL2 which is actually a Linux kernel so unsurprisingly it gets the same behavior that we see with other Linux environments and writes as root if the
That's for the new image run on Mac? You still get good results when running the parca/sim though?
There's actually already a note about this in docs/README.md:
|
Results (1) were inside the container (where it thinks there are 6 CPUs -- default Docker allocation), while results (2) were outside the container (where it thinks there are 12 CPUs).
Could be. Any idea how to tell by examining the
Yes.
I just started 2 sim gens. We'll see.
Arg. I'll change it to a setup step so it's harder to miss. |
Do you have the build log? It should say if it was single threaded when it was compiled or how many CPUs it was configured for. Not sure how to get that info after compilation... |
The build log confirms that OpenBLAS was single threaded: I'll test USE_THREAD=0 on Mac to see if that gets results consistent with the others. In the previous tests inside and outside Docker, the parca and sim outs differ although I didn't run analysis plots yet. Here are the behavior metrics, if that helps judge for good results: docker-numpy-1.19.1:
mac-numpy-1.19.1:
|
|
@tahorst do you have a test case for different output with |
What kind of output do you want for the test case? |
Maybe more than one test case.
|
I'm not sure exactly where the differences are coming from so I don't know if we can create a simple test case for the |
We could brainstorm debugging techniques. E.g. run w/ and w/o |
You're thinking of looking at different points when the parca is running? My guess from the diffs that show up when comparing the final sim_data objects from the different Docker containers w/ and w/o |
Good idea. Top level questions:
Findings on Sherlock (pyenv wcEcoli3-staging):
Findings in Docker:
Findings on Mac & Linux? Also:
|
The different ways to install OpenBLAS and NumPy × different runtime environments produce at least 7 different equivalence classes of Parca outputs. There does not seem to be an installation approach (even if the instructions are environment-specific) to get cross-platform consistent Parca output. See OpenMathLib/OpenBLAS#2244 (comment) It could be useful to investigate where these Parca computations diverge, but I'm not inclined to spend the time on that. |
* `test_openblas_threads.py` tests openblas with varying numbers of threads. The results are inconsistent in some installations. So **always** set `OPENBLAS_NUM_THREADS=1`. * I looked for ways to set up runtime environments on each platform for consistent Parca results but it did not pan out! See #931 (comment) * Get bug fix releases Python 3.8.5, numpy 1.19.2, and scipy 1.5.2, but they didn't help this problem. I updated the Dockerfile and the pyenv `wcEcoli3-staging` on Sherlock and will update `wcEcoli3`, but it's unclear if it matters for y'all to follow suit in your development environments since results will still vary by OS and other details. * It turns out that building a Docker Image on Linux needn't disable AVX2 vector instructions to use the Image on macOS. We have to do that only when building the Image on macOS. It'll get different Parca results so consider cloud-building Images with `cloud/build.sh` (= `cloud/build-runtime.sh` + `cloud/build-wcm.sh`) rather than `cloud/build-containers-locally.sh`. * Improve the docs on setting up your pyenv, explaining this stuff, fixing some confusions like Sherlock setup, and adding workarounds like `runtime_library_dirs` and `LDFLAGS` for roadblocks. These docs are more informative but more complicated. A tech writer could help. * I don't know when it's better to compile openblas vs. let numpy & scipy install their embedded copies, but I've written off using a package manager to install openblas. * `compareParca.py` and `diff_simouts.py` now count the difference lines as a single figure-of-merit, and there's a CLI option to print just the count. This helped compare environments (table in #931 (comment)). * Catch attempts to use Atlas MongoDB with GCloud. Atlas could be made to work there but there's no point since the GCloud MongoDB is better. We just can't use the GCloud MongoDB without opening it to external access. * Adjust `augment-pycharm-package-list.sh` for the new Vivarium package names. People might have to manually replace the old Vivarium package names in PyCharm's exception list. This'd be easier if the package and PyPI names matched: `vivarium_core`, `vivarium_cell`. * `mypy.ini`: Include the new `colony/` directory in source dirs to type-check. * Drop Python 2.7 from PyCharm compatibility inspections.
The bug fix #980 for portability issue #979 raises this question: Is numpy's random number generator returning different results between @ggsun's local machine and Sherlock? What could cause that?
There are uses of the default random instance in prototypes/, tests, and sim variants (those should be OK), analysis scripts (problematic when analyses run in parallel but won't affect the sims), and polymerize (that's concerning). When moving to Python 3, we also moved from numpy 1.14.6 to 1.19.2. NEP 19 also says |
When building a `wcm-runtime` Docker Image, make the default be to install numpy & scipy from wheels with their embedded copies of OpenBLAS. It's **much faster** and it's the usual approach so we're less likely to hit installer bugs like #1040. The reason for compiling OpenBLAS from source was originally to get a bug fix and later to avoid the AVX2 problem building it in Docker-for-Mac. This does change Parca outputs but that's all part of #931.
Also see #1097 |
The CACM article Keeping Science on Keel When Software Moves describes techniques and tools that a climate modeling project uses to debug reproducibility problems in their large models. The article lacks key details, but the FLiT tool's technique as described might be relevant. Also, all the examples in the article came down to compile optimizations that used Fused Multiply-Add (FMA) floating point instructions. |
It's nice to see there's a tool to test for the floating point issues and it's interesting that other people have run into these problems. Do you think there are compiler flags for BLAS (I'm assuming that's where this is coming from but could be other dependencies) that make it slightly less efficient but reproducible? I also wonder if we can truncate/round floating point results in our code where these differences come up so our precision is less but we get above the machine level noise. |
It'd be good to brainstorm on this problem. More observations:
Interesting idea! It might require localizing the cause to particular pieces of code. Then it might be able to erase some hardware variability and some order-of-evaluation variability. This option sounds promising for performance in Docker but riskier for reproducibility:
|
Another variable is the C++ compiler: clang on macOS vs. gcc on Ubuntu, Docker, and CentOS (Sherlock). We could set up clang (how?) for use by pyenv, pip, Aesara, Cython, & Numba. What triggered this realization is Santiago and Niels hit a problem on Ubuntu where Parca calls Aesara which fails to compile & link some code. A workaround is to install Python using |
I noticed that Sherlock output with python3 does not match my local output. This starts with small floating point differences in nearly all expression related items in
sim_data
. Does anyone else see a difference in parca output on their local environment vs Sherlock? I was usingpython runscripts/debug/compareParca.py out/local out/sherlock
to compare local output to Sherlock output that I copied locally. Just wondering if it's an installation difference or a machine difference.The text was updated successfully, but these errors were encountered: