Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GitHub action failure bug #107

Merged
merged 41 commits into from
Dec 6, 2022
Merged

Conversation

fankiat
Copy link
Owner

@fankiat fankiat commented Dec 1, 2022

[WIP] Fixes #101

I suspect it might be coming from numba. This PR is opened prior to completely fixing the issue so that the github actions get executed with each attempt to debug the underlying issue.

@fankiat fankiat added bug Something isn't working prio:medium Medium priority labels Dec 1, 2022
@fankiat fankiat self-assigned this Dec 1, 2022
@fankiat
Copy link
Owner Author

fankiat commented Dec 1, 2022

UPDATE: fastmath does not seem to be the issue after further investigation. See update in comments below.

It seems that fastmath in numba is doing something weird in generate_eulerian_to_lagrangian_grid_interpolation_kernel_2d. Removing fastmath seems to pass the CI consistently so far. This again could be system dependent, since both with and without fastmath works fine on my MBA 2022 (ARM M2) laptop; I will test this again when I get my hands on my MBP 2018 (Intel). In any case, removing fastmath does not seem to change the runtime much. I've measured some preliminary timings with 4 MPI processes and below are the comparisons averaged over 1000 kernel calls for lag grid of size (3, 1000) and eulerian grid of size (2048, 2048).

(1) With fastmath

Scalar eul-to-lag grid interp took 2.7408207999542356e-05 s/call
Vector eul-to-lag grid interp took 5.4061957984231415e-05 s/call

(2) Without fastmath

Scalar eul-to-lag grid interp took 2.9133250005543232e-05 s/call
Vector eul-to-lag grid interp took 5.913654190953821e-05 s/call

I will perform a few more tests before proceeding with removing the fastmath option from numba.

@fankiat
Copy link
Owner Author

fankiat commented Dec 1, 2022

UPDATE: fastmath does not seem to be the issue after further investigation. See update in comments below.

I did a quick and dirty profiling for the kernels in EulerianLagrangianGridCommunicator with and without fastmath, for the same settings as above (mpirun -np 4, eulerian grid size (2048, 2048) and lagrangian grid size (3, 1000)) with timings averaged over 1000 kernel calls. The results are as below, which suggests little benefit of having fastmath. Therefore, for the time being, until a proper profiling tool is available and an extensive performance evaluation is done, I am removing all the fastmath features in the kernels using numba.

With fastmath

local_eulerian_grid_support_of_lagrangian_grid_kernel took 8.015662501566112e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (scalar) took 2.886795892845839e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (vector) took 6.536329200025648e-05 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (scalar) took 0.0002470621250104159 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (vector) took 0.00039066899998579176 s/call
interpolation_weights_kernel took 0.00016611195809673518 s/call

Without fastmath

local_eulerian_grid_support_of_lagrangian_grid_kernel took 7.912083296105266e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (scalar) took 3.2219332992099224e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (vector) took 5.795166699681431e-05 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (scalar) took 0.0001622442079242319 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (vector) took 0.00030771862505935135 s/call
interpolation_weights_kernel took 0.00016169204202014952 s/call

@bhosale2
Copy link
Collaborator

bhosale2 commented Dec 2, 2022

@fankiat a minor comment here, once you figure out the settings that work, I would recommend one of the two options:

  1. Rebase the commit history to that final commit, to keep the commit history clean.
  2. Make a new PR to resolve this, with that commit as the only commit. This PR can then be closed without merge, and kept in the log for future reference.

@fankiat
Copy link
Owner Author

fankiat commented Dec 2, 2022

@bhosale2 I'll probably squash before merging later when it's done. What do you think?

@bhosale2
Copy link
Collaborator

bhosale2 commented Dec 2, 2022

@fankiat that works as well.

@fankiat fankiat added prio:high High priority and removed prio:medium Medium priority labels Dec 3, 2022
@fankiat
Copy link
Owner Author

fankiat commented Dec 5, 2022

@bhosale2 some updates here. As I work on this, I am starting to think this is potentially a combined issue from both numba (likely caching) and the possible existence of flaky test (typically test suites running in parallel are more susceptible to). In order to mitigate both potential sources of issue, I have done the following and it seems to have resolved the issue (github actions passed ~10 times without any issue, see 4bb81eb).

For mitigating possible issues from numba

  • Remove cache=True for all numba kernels. The rationale here is sopht-mpi is meant to be executed with a larger number of processors and it is probably better for all the processors to spend some time compiling their own numba kernels than to load from disk (see issue mentioned by one of the main contributor of numba). This makes sense especially given that we don't know if numba is tested to be robust in cache read/write on a decently large scale (again mentioned here, by the same contributor)
  • Rearrange the order of compiling eul-lag interpolation kernels when it can either be a scalar or vector kernel. The updated EulerianLagrangianGridCommunicator2D.py ensures that only required kernels are compiled and returned (i.e. moving kernel generation into the if-else branch for checking n_components, instead of generating both scalar and vector versions of the kernels and returning only what is requested)

For mitigating possible issues coming from pytest

  • Introduce pytest cache clearing, as recommended for invocations from Continuous Integration servers where isolation and correctness are more important than speed.
  • Update the test approach in immersed boundary test cases by first reducing the results from all ranks before asserting correctness. The result of doing this will cause all ranks to fail a test if any one rank performing the same test failed. The goal of the updated approach is to open up options for mitigating approaches like rerunning failed (flaky) tests, as suggested here, without running into notorious deadlock situations.
  • Allow for reruns via pytest-rerunfailures when flaky tests result occurs.

@fankiat fankiat changed the title [WIP] fix GitHub action failure bug Fix GitHub action failure bug Dec 5, 2022
@fankiat fankiat requested a review from bhosale2 December 5, 2022 23:31
@fankiat
Copy link
Owner Author

fankiat commented Dec 6, 2022

Some additional notes on this: There are two errors that typically cause the CI to fail.

  1. Broadcast error, where the arrays in numba kernels are finding inconsistency in the shapes of the arrays it is working with. This I believe is the error coming from numba caching when multiple processes are involved. The changes in this PR to numba kernels should resolve that.
  2. Value error, where sometimes interpolated values on the lagrangian grid have one element that is slightly different from the solution we draw from sopht. This issue is a little bit different, I am starting to think that this is coming from the fact that the test case is set up assuming zero rounding errors. I will look into this probably in a separate issue. In any case, the changes from the pytest side in this PR nonetheless provide an additional safeguard against any pytest caching issues and flaky tests.

Please let me know if you want to discuss this further to clarify things. Thanks @bhosale2 !

@fankiat fankiat merged commit 68f30cf into main Dec 6, 2022
@fankiat fankiat deleted the 101_fix_github_action_failure_bug branch December 6, 2022 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working prio:high High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random occasional failure in CI
2 participants