Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure build fails when test command fails #175

Merged
merged 7 commits into from
Nov 9, 2019

Conversation

GjjvdBurg
Copy link
Contributor

First of all, thank you for creating this project!

I'm running the latest version (master) to test my package in a virtualenv and noticed that on Linux the return code of the test isn't used. This means that the build continues even though the test command fails. This PR fixes this by keeping the return code of the test command and making that the return code of the bash script.

Note that this is not needed on mac or windows because they use subprocess.check_call.

@joerick
Copy link
Contributor

joerick commented Oct 19, 2019

Hi! Thanks for the contribution. I'm a little confused why it's needed though, since our bash script uses set -o errexit. I was wondering if that wasn't passing to the subshell, so I tried this test:

joerick@joerick2 ~/Desktop> cat testsh.bash                                   1
#!/bin/bash

set -o errexit
set -o xtrace

echo "Starting"
true

(
  echo subshell 1
  (
    echo subshell 2
    false
    true
  )
  true
)

echo Complete
joerick@joerick2 ~/Desktop> bash testsh.bash 
+ echo Starting
Starting
+ true
+ echo subshell 1
subshell 1
+ echo subshell 2
subshell 2
+ false

In this test, the first failing command false stops the execution and exits the script with an error. Did you have an example (perhaps a build log) of that not happening?

@GjjvdBurg
Copy link
Contributor Author

GjjvdBurg commented Oct 19, 2019

Thanks for checking this out! You're right that your test scripts exits properly, so I'm now also confused why this didn't happen in my tests.

Here is an example of a failing build: https://travis-ci.org/alan-turing-institute/CleverCSV/jobs/599248169 (around line 670).

The mv "$delocated_wheel /output command still runs even though the test failed. The error itself is not really relevant, it can be fixed by removing the *.so files before running cibuildwheel.

Edit: And here is a build using the proposed fix that errors out at the point of the failing test: https://travis-ci.org/alan-turing-institute/CleverCSV/jobs/599577038 (line 670)

@GjjvdBurg
Copy link
Contributor Author

Here is a build that prints out set -o when running the test command: https://travis-ci.org/GjjvdBurg/TestingTravis/builds/600010616 Line 309 shows that errexit is off in the subshell.

@YannickJadoul
Copy link
Member

Hmmm, so, can we then just add set -o errexit (and maybe set -o xtrace as well) to the subshell?

@GjjvdBurg
Copy link
Contributor Author

Yeah, that would be much cleaner than my solution. I'll make that change and update this PR when I have it.

@joerick joerick mentioned this pull request Oct 20, 2019
Some trial and error to make sure it works reliably.

* remove test_retcode
* debug local bash settings
* set errexit explicitly in subshell
* set errexit also in the subshell in parenthesis
* debugging
* more debugging (exit code of sh)
* is the venv subshell returning exit 1?
* what if we remove the additional subshell?
* then we need to unquote the command
* if statement needs test command quoted
* simple if statement
* echo command needs to be removed of course
* remove debug statements
* check if setting errexit here is needed
* these settings can be removed
  At this point ("before") the shell already has errexit set.
* setting errexit in the sh invocation is needed
  This is not inherited, so it's good to specify it.
* just double checking that the subshell doesn't exit
* use solution from fvue.nl
* Solution doesn't seem to work, even popd is run
  So now I'm putting back the if statement, it seems
  to be the simplest and most effective solution.
* add a comment
* prove that errexit is necessary
* errexit is indeed needed for chaining
@GjjvdBurg
Copy link
Contributor Author

The new commit simplifies the test and seems to work reliably.

Bash 4.4 adds the inherit_errexit shell option, which would be good to use. Unfortunately, the manylinux docker image comes with bash version 3.2.25. In that version the script given above by @joerick does not work as expected. It seems that errexit is a bit of an issue in bash. Here is a more detailed overview of all the caveats.

I found that even when I place set -o errexit in every subshell, the top-level bash script doesn't exit if the subshell for the virtualenv returns a nonzero exit code. The simplest and most reliable solution was to check the exit code of that subshell explicitly using an if statement. Also, the sh -c {testcommand} has been changed to add -o errexit in case the user chains test commands using ; instead of &&.

@YannickJadoul
Copy link
Member

YannickJadoul commented Oct 21, 2019

Great, thanks!

Are we now just focusing on the CIBW_TEST_COMMAND, though? What happens if pip install "$delocated_wheel"{test_extras} would fail?
Is there no way of putting set -o errexit and set -o xtrace before source "$venv_dir/bin/activate, so we don't have to check if [ $? -ne 0 ] afterwards? (I'm not familiar enough with bash to know if this is possible, but it would be great if the error would "cascade" down the subshells.)

@GjjvdBurg
Copy link
Contributor Author

Good question! Any failure in the subshell for the virtualenv will make the returncode of that subshell nonzero. The problem is only that that does not cause the toplevel shell to register an error. Here is an example where the pip install "$delocated_wheel"{test_extras} command fails due to a non-existing package: https://travis-ci.org/GjjvdBurg/TestingTravis/builds/600808779 As you can see the subshell exits at that point, but the if statement is reached (line 305) indicating that the top-level script didn't exit right away.

The links in my previous comment give some more details about the problems of errexit and subshells. Here's a demonstration that prints set -o after activating the virtualenv and in the test command, which shows that errexit is on in both, but despite that the if statement is reached (which it shouldn't, if errexit would be sufficient): https://travis-ci.org/GjjvdBurg/TestingTravis/builds/600813804

I agree that the if statement isn't a particularly clean solution, but it was the most robust approach I could find (with some trial and error).

@YannickJadoul
Copy link
Member

Ha, that's funny. But apparently this whole error-handling is quite fragile in bash, so fair enough. Then you have indeed found a very clean solution!

Thanks a lot for your investigation and extensive explanation! Always good to have this for future reference, if we forget the reasons for this change! :)

@joerick, I thought I've heard you mention long time ago you'd like to replace the bash script with proper Python? If we ever get to doing it, this can be added to the list of reasons why it's a maybe a good idea ;)

Copy link
Contributor

@joerick joerick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @GjjvdBurg for catching this!

oh boy, how many bash scripts with errexit have I written that might have bugs in them! Okay, so I think I understand - the erroring subshell isn't being detected by the parent shell.

But, I am still confused by why my test PR #177 is passing. That test ensures that CIBW_TEST_COMMAND=false causes the build to fail... and it passes. Any idea why? Could we get a failing test that this PR fixes?

cibuildwheel/linux.py Outdated Show resolved Hide resolved
@YannickJadoul
Copy link
Member

YannickJadoul commented Oct 22, 2019

That test ensures that CIBW_TEST_COMMAND=false causes the build to fail... and it passes.

You're not alone to be confused, @joerick. I just tried this, thinking I knew it was going to fail (well, so, fail in this case meaning "not detecting the failure"), but ... nope

set -o errexit
set -o xtrace

echo A
(
        echo B
        sh -c "false"
        echo C
)
echo D
$ bash test.sh; echo $?
+ echo A
A
+ echo B
B
+ sh -c false
1

@GjjvdBurg
Copy link
Contributor Author

Thanks for your comments! I'll add a failing test that this PR fixes and get back to you soon.

@joerick
Copy link
Contributor

joerick commented Nov 4, 2019

Hi @GjjvdBurg, just checking in, any progress on this?

@GjjvdBurg
Copy link
Contributor Author

Apologies for the delay, I'll try to figure it out this week.

@GjjvdBurg
Copy link
Contributor Author

GjjvdBurg commented Nov 7, 2019

So, essentially, I can no longer reproduce my original problem with the latest version of cibuildwheel. It seems that upgrading the linux images to manylinux2010 has solved it, probably because manylinux2010 comes with an upgraded version of bash (3.2.25 -> 4.1.2).

Here is a build for a test package with the version of cibuildwheel before #155 was merged:
https://travis-ci.org/GjjvdBurg/TestingTravis/builds/608870166#L184 (see line 386, where failure is not caught).

And here is one with the version after manylinux was upgraded: https://travis-ci.org/GjjvdBurg/TestingTravis/builds/608871534#L184 (which correctly fails).

As further proof that the manylinux change solved it, here is a version of cibuildwheel branched off before #155 was merged and with the test from #177 added (in a temporary copy of the repo). With manylinux1 the failing test did indeed fail.

So there's no need to merge this PR anymore! The problem has gone away in another way and the test added in #177 ensures it can be detected in the future. Thanks for your time on this @joerick and @YannickJadoul!

@GjjvdBurg GjjvdBurg closed this Nov 7, 2019
@YannickJadoul
Copy link
Member

Huh, that's really weird! Thanks for further checking that.
We're still allowing manylinux1, so should we still try to merge this, though? I guess it can't hurt either, and it might solve discrepancies on sóme docker images?

@joerick Yet another reason to take some time for #99? Maybe after the next release, there's a bit more time to do these kinds of refactoring.

@GjjvdBurg
Copy link
Contributor Author

Yes, with manylinux1 this fix would still be needed and it can't hurt for the manylinux2010 images.

I've updated my branch with the changes requested above and have clarified the comment. I'll reopen the PR so you can decide on whether to merge or not.

@joerick
Copy link
Contributor

joerick commented Nov 9, 2019

Yes, with manylinux1 this fix would still be needed

We have a winner! I just ran a test with manylinux1 and indeed it's erroneously passing.

    def test_failing_test():
        '''Ensure a failing test causes cibuildwheel to error out and exit'''
        project_dir = os.path.dirname(__file__)
    
        with pytest.raises(subprocess.CalledProcessError):
>           utils.cibuildwheel_run(project_dir, add_env={
                'CIBW_TEST_COMMAND': 'false',
                # manylinux1 has a version of bash that's been shown to have
                # problems with this, so let's check that.
                'CIBW_MANYLINUX_I686_IMAGE': 'manylinux1',
                'CIBW_MANYLINUX_X86_64_IMAGE': 'manylinux1',
            })
E           Failed: DID NOT RAISE <class 'subprocess.CalledProcessError'>

Thank you all for raising this! Very happy we spotted this before a release.

@joerick joerick merged commit 42d1610 into pypa:master Nov 9, 2019
@YannickJadoul
Copy link
Member

YannickJadoul commented Nov 9, 2019

Thanks, @GjjvdBurg, for all the work and for holding on during the discussion!

@GjjvdBurg
Copy link
Contributor Author

Glad I could help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants