Fix GitHub action failure bug #107

fankiat · 2022-12-01T01:29:06Z

[WIP] Fixes #101

I suspect it might be coming from numba. This PR is opened prior to completely fixing the issue so that the github actions get executed with each attempt to debug the underlying issue.

…kernel

fankiat · 2022-12-01T19:05:35Z

UPDATE: fastmath does not seem to be the issue after further investigation. See update in comments below.

It seems that fastmath in numba is doing something weird in generate_eulerian_to_lagrangian_grid_interpolation_kernel_2d. Removing fastmath seems to pass the CI consistently so far. This again could be system dependent, since both with and without fastmath works fine on my MBA 2022 (ARM M2) laptop; I will test this again when I get my hands on my MBP 2018 (Intel). In any case, removing fastmath does not seem to change the runtime much. I've measured some preliminary timings with 4 MPI processes and below are the comparisons averaged over 1000 kernel calls for lag grid of size (3, 1000) and eulerian grid of size (2048, 2048).

(1) With fastmath

Scalar eul-to-lag grid interp took 2.7408207999542356e-05 s/call
Vector eul-to-lag grid interp took 5.4061957984231415e-05 s/call

(2) Without fastmath

Scalar eul-to-lag grid interp took 2.9133250005543232e-05 s/call
Vector eul-to-lag grid interp took 5.913654190953821e-05 s/call

I will perform a few more tests before proceeding with removing the fastmath option from numba.

fankiat · 2022-12-01T22:17:24Z

UPDATE: fastmath does not seem to be the issue after further investigation. See update in comments below.

I did a quick and dirty profiling for the kernels in EulerianLagrangianGridCommunicator with and without fastmath, for the same settings as above (mpirun -np 4, eulerian grid size (2048, 2048) and lagrangian grid size (3, 1000)) with timings averaged over 1000 kernel calls. The results are as below, which suggests little benefit of having fastmath. Therefore, for the time being, until a proper profiling tool is available and an extensive performance evaluation is done, I am removing all the fastmath features in the kernels using numba.

With fastmath

local_eulerian_grid_support_of_lagrangian_grid_kernel took 8.015662501566112e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (scalar) took 2.886795892845839e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (vector) took 6.536329200025648e-05 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (scalar) took 0.0002470621250104159 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (vector) took 0.00039066899998579176 s/call
interpolation_weights_kernel took 0.00016611195809673518 s/call

Without fastmath

local_eulerian_grid_support_of_lagrangian_grid_kernel took 7.912083296105266e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (scalar) took 3.2219332992099224e-05 s/call
eulerian_to_lagrangian_grid_interpolation_kernel (vector) took 5.795166699681431e-05 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (scalar) took 0.0001622442079242319 s/call
lagrangian_to_eulerian_grid_interpolation_kernel (vector) took 0.00030771862505935135 s/call
interpolation_weights_kernel took 0.00016169204202014952 s/call

…r_mpi_2d.py

bhosale2 · 2022-12-02T18:13:12Z

@fankiat a minor comment here, once you figure out the settings that work, I would recommend one of the two options:

Rebase the commit history to that final commit, to keep the commit history clean.
Make a new PR to resolve this, with that commit as the only commit. This PR can then be closed without merge, and kept in the log for future reference.

fankiat · 2022-12-02T18:23:15Z

@bhosale2 I'll probably squash before merging later when it's done. What do you think?

bhosale2 · 2022-12-02T18:28:02Z

@fankiat that works as well.

fankiat · 2022-12-05T22:36:20Z

@bhosale2 some updates here. As I work on this, I am starting to think this is potentially a combined issue from both numba (likely caching) and the possible existence of flaky test (typically test suites running in parallel are more susceptible to). In order to mitigate both potential sources of issue, I have done the following and it seems to have resolved the issue (github actions passed ~10 times without any issue, see 4bb81eb).

For mitigating possible issues from `numba`

Remove cache=True for all numba kernels. The rationale here is sopht-mpi is meant to be executed with a larger number of processors and it is probably better for all the processors to spend some time compiling their own numba kernels than to load from disk (see issue mentioned by one of the main contributor of numba). This makes sense especially given that we don't know if numba is tested to be robust in cache read/write on a decently large scale (again mentioned here, by the same contributor)
Rearrange the order of compiling eul-lag interpolation kernels when it can either be a scalar or vector kernel. The updated EulerianLagrangianGridCommunicator2D.py ensures that only required kernels are compiled and returned (i.e. moving kernel generation into the if-else branch for checking n_components, instead of generating both scalar and vector versions of the kernels and returning only what is requested)

For mitigating possible issues coming from `pytest`

Introduce pytest cache clearing, as recommended for invocations from Continuous Integration servers where isolation and correctness are more important than speed.
Update the test approach in immersed boundary test cases by first reducing the results from all ranks before asserting correctness. The result of doing this will cause all ranks to fail a test if any one rank performing the same test failed. The goal of the updated approach is to open up options for mitigating approaches like rerunning failed (flaky) tests, as suggested here, without running into notorious deadlock situations.
Allow for reruns via pytest-rerunfailures when flaky tests result occurs.

fankiat · 2022-12-06T03:05:22Z

Some additional notes on this: There are two errors that typically cause the CI to fail.

Broadcast error, where the arrays in numba kernels are finding inconsistency in the shapes of the arrays it is working with. This I believe is the error coming from numba caching when multiple processes are involved. The changes in this PR to numba kernels should resolve that.
Value error, where sometimes interpolated values on the lagrangian grid have one element that is slightly different from the solution we draw from sopht. This issue is a little bit different, I am starting to think that this is coming from the fact that the test case is set up assuming zero rounding errors. I will look into this probably in a separate issue. In any case, the changes from the pytest side in this PR nonetheless provide an additional safeguard against any pytest caching issues and flaky tests.

Please let me know if you want to discuss this further to clarify things. Thanks @bhosale2 !

fankiat added 2 commits November 30, 2022 16:52

upgrade action checkout version

cfa6ca8

test fix: removing numba fast_math

94e4247

fankiat added bug Something isn't working prio:medium Medium priority labels Dec 1, 2022

fankiat self-assigned this Dec 1, 2022

fankiat added 3 commits November 30, 2022 17:42

temporary test only immersed boundary ops

d7d2e19

test fix: removing numba fast_math for scalar eul-to-lag grid interp …

ee6330d

…kernel

restore to perform test on all test cases

29e6e8f

Merge branch 'main' into 101_fix_github_action_failure_bug

1077952

fankiat added 19 commits December 1, 2022 14:24

remove: fastmath option in numba

b604a74

test: compare 2d and 3d test cases run time on github actions

9fcc543

update recipe name

9d6c1aa

separate 2d and 3d tests to see timings

bc389cc

restore make test to test all test cases

3636e5c

test: clear numba cache approach

3db627c

temporary run test for only test_eulerian_lagrangian_grid_communicato…

2a0bfae

…r_mpi_2d.py

test virtual boundary forcing

6b95d92

remove clear numba cache

b5f7646

rearrange kernel generation order

9ace1fe

rearrange kernel generation order for lag-to-eul kernel

cf2d96e

restore all tests

6339d05

test only virtual boundary

00b4175

test immersed boundary ops

cb7c16e

remove fastmath

95c3ef4

remove numba cache option, reuse fastmath

725b64d

manual removal of numba cache in immersed body op tests

606c0c2

add check for n_components in EulerianLagrangiangridComm

95d2726

restore all tests

a53ebfd

fankiat added 11 commits December 2, 2022 10:30

compile kernel only when needed

7a5fafa

remove manual numba cache clear and add pytest cache clear

adbfef9

remove list dir

9cea9df

restore all tests

6b6c050

test only immersed body op for now

2c379d6

feat: check and assert compatible ghost size and interp kernel width

a658d62

update assert approach for flaky tests

a180f8b

Remove all numba cache option

eea824d

Remove all numba cache option

2b23de5

feat: pytest-rerunfailures + removing unnecessary color option for black

5a09c66

restore all tests

b994ac2

fankiat added prio:high High priority and removed prio:medium Medium priority labels Dec 3, 2022

fankiat and others added 4 commits December 4, 2022 14:53

try without pytest reruns

307942d

Revert immersed boundary test cases

4987597

Allow pytest MPI reruns

8f531a0

Merge branch 'main' into 101_fix_github_action_failure_bug

4bb81eb

fankiat changed the title ~~[WIP] fix GitHub action failure bug~~ Fix GitHub action failure bug Dec 5, 2022

fankiat requested a review from bhosale2 December 5, 2022 23:31

Merge branch 'main' into 101_fix_github_action_failure_bug

aff1fb8

bhosale2 approved these changes Dec 6, 2022

View reviewed changes

fankiat merged commit 68f30cf into main Dec 6, 2022

fankiat deleted the 101_fix_github_action_failure_bug branch December 6, 2022 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GitHub action failure bug #107

Fix GitHub action failure bug #107

fankiat commented Dec 1, 2022

fankiat commented Dec 1, 2022 •

edited

Loading

fankiat commented Dec 1, 2022 •

edited

Loading

bhosale2 commented Dec 2, 2022

fankiat commented Dec 2, 2022

bhosale2 commented Dec 2, 2022

fankiat commented Dec 5, 2022 •

edited

Loading

fankiat commented Dec 6, 2022

Fix GitHub action failure bug #107

Fix GitHub action failure bug #107

Conversation

fankiat commented Dec 1, 2022

fankiat commented Dec 1, 2022 • edited Loading

fankiat commented Dec 1, 2022 • edited Loading

bhosale2 commented Dec 2, 2022

fankiat commented Dec 2, 2022

bhosale2 commented Dec 2, 2022

fankiat commented Dec 5, 2022 • edited Loading

For mitigating possible issues from numba

For mitigating possible issues coming from pytest

fankiat commented Dec 6, 2022

fankiat commented Dec 1, 2022 •

edited

Loading

fankiat commented Dec 1, 2022 •

edited

Loading

fankiat commented Dec 5, 2022 •

edited

Loading

For mitigating possible issues from `numba`

For mitigating possible issues coming from `pytest`