Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emscripten test worker sometimes fails witth "BlockingIOError: write could not complete without blocking" #94026

Closed
tiran opened this issue Jun 20, 2022 · 4 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@tiran
Copy link
Member

tiran commented Jun 20, 2022

Bug report

Sometimes test worker processes on wasm32-emscripten build bot fails with BlockingIOError: write could not complete without blocking. Example https://buildbot.python.org/all/#/builders/1044/builds/27/steps/10/logs/stdio

0:37:30 load avg: 4.42 [422/436/3] test_pathlib crashed (Exit code 1)
... 
result=Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/regrtest.py", line 47, in <module>
    _main()
    ^^^^^^^
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/regrtest.py", line 43, in _main
    main()
    ^^^^^^
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/libregrtest/main.py", line 754, in main
    Regrtest().main(tests=tests, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/libregrtest/main.py", line 692, in main
    self._main(tests, kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/libregrtest/main.py", line 713, in _main
    run_tests_worker(self.ns, self.worker_test_name)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/buildbot/bcannon-wasm/3.x.bcannon-wasm.emscripten-node/build/Lib/test/libregrtest/runtest_mp.py", line 102, in run_tests_worker
    print(json.dumps(result, cls=EncodeTestResult), flush=True)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
BlockingIOError: [Errno 6] write could not complete without blocking

The error seems to occur at the end of a test job when the worker process returns its status to the control process. The control process runs CPython natively (x86_64-pc-linux-gnu in case of the build bot). The worker processes is wasm32-emscripten CPython with NodeJS as runner. I guess that every now and then, the test worker process produces output faster than the control process can consume from the pipe that connects stdout of the worker and read pipe of the control process. The runtime does not like blocking IO and returns EAGAIN.

>>> import sys, errno
>>> sys.platform
'emscripten'
>>> errno.errorcode[6]
'EAGAIN'
@tiran tiran added type-bug An unexpected behavior, bug, or error tests Tests in the Lib/test dir 3.11 only security fixes 3.12 bugs and security fixes labels Jun 20, 2022
@tiran
Copy link
Member Author

tiran commented Jun 20, 2022

The fcntl module is currently not available on Emscripten. I hacked the module manually and was able to get flags from the stdout fd. Emscripten does not report that the fd is set to non-blocking mode.

>>> import os, fcntl
>>> fcntl.fcntl(1, fcntl.F_GETFL)
577
>>> fcntl.fcntl(1, fcntl.F_GETFL) & os.O_NONBLOCK
0

@tiran
Copy link
Member Author

tiran commented Jun 21, 2022

I applied a hack on the buildbot worker host. Buildbot now starts NodeJS with nice -n5. That seems to be good enough to make the tests pass.

tiran added a commit to tiran/cpython that referenced this issue Jun 23, 2022
…ipten

`runtest_mp` test worker now handles :exc:`BlockingIOError` to work around
non-blocking pipes on Emscripten.
@serhiy-storchaka
Copy link
Member

Why is it set to non-blocking mode? Can it be set to blocking mode? Can it be replaced with an fd set to blocking mode? Can TextIOWrapper and BufferedWriter be made to handle this internally? Can we use an alternate channel for communication between processes?

@tiran
Copy link
Member Author

tiran commented Jun 23, 2022

It is set to non-blocking mode because different rules apply to browser-like environments. It is not possible to change the property. In fact there is no way to even detect that the pipe fd is in non-blocking mode. I can only deduce that it is in non-blocking mode because tests sometimes fail when the system is under high load and the other side cannot read from the pipe fast enough.

Emscripten tries hard to mimic a POSIX-like OS on top of a JavaScript and WebAssembly engine. Ultimately it is bound by restriction of the runtime environment.

To handle the special case in Python's io module we would first have to detect the special case. That is not possible on Emscripten yet. The fcntl module is not available because I disabled ioctl as it leads to crashes. The fcntl(F_GETFD) syscall does not return O_NONBLOCK and fstat does not work on pipes and standard streams.

We could only use a temporary file for communication. Socket's don't work as expected. IPC and socketpair are not available.

tiran added a commit to tiran/cpython that referenced this issue Jun 29, 2022
tiran added a commit that referenced this issue Jun 29, 2022
tiran added a commit to tiran/cpython that referenced this issue Jun 29, 2022
pythonGH-94253)

Co-authored-by: Victor Stinner <[email protected]>.
(cherry picked from commit 199ba23)

Co-authored-by: Christian Heimes <[email protected]>
@tiran tiran closed this as completed Jun 29, 2022
tiran added a commit that referenced this issue Jun 29, 2022
gvanrossum pushed a commit to gvanrossum/cpython that referenced this issue Jun 30, 2022
vstinner added a commit to vstinner/cpython that referenced this issue Sep 2, 2023
vstinner pushed a commit to vstinner/cpython that referenced this issue Sep 2, 2023
vstinner added a commit that referenced this issue Sep 3, 2023
…108820)

* Revert "[3.11] gh-101634: regrtest reports decoding error as failed test (#106169) (#106175)"

This reverts commit d5418e9.

* Revert "[3.11] bpo-46523: fix tests rerun when `setUp[Class|Module]` fails (GH-30895) (GH-103342)"

This reverts commit ecb09a8.

* Revert "gh-95027: Fix regrtest stdout encoding on Windows (GH-98492)"

This reverts commit b2aa28e.

* Revert "[3.11] gh-94026: Buffer regrtest worker stdout in temporary file (GH-94253) (GH-94408)"

This reverts commit 0122ab2.

* Revert "Run Tools/scripts/reindent.py (GH-94225)"

This reverts commit f0f3a42.

* Revert "gh-94052: Don't re-run failed tests with --python option (GH-94054)"

This reverts commit 1347607.

* Revert "[3.11] gh-84461: Fix Emscripten umask and permission issues (GH-94002) (GH-94006)"

This reverts commit 1073184.

* gh-93353: regrtest checks for leaked temporary files (#93776)

When running tests with -jN, create a temporary directory per process
and mark a test as "environment changed" if a test leaks a temporary
file or directory.

(cherry picked from commit e566ce5)

* gh-93353: Fix regrtest for -jN with N >= 2 (GH-93813)

(cherry picked from commit 36934a1)

* gh-93353: regrtest supports checking tmp files with -j2 (#93909)

regrtest now also implements checking for leaked temporary files and
directories when using -jN for N >= 2. Use tempfile.mkdtemp() to
create the temporary directory. Skip this check on WASI.

(cherry picked from commit 4f85cec)

* gh-84461: Fix Emscripten umask and permission issues (GH-94002)

- Emscripten's default umask is too strict, see
  emscripten-core/emscripten#17269
- getuid/getgid and geteuid/getegid are stubs that always return 0
  (root). Disable effective uid/gid syscalls and fix tests that use
  chmod() current user.
- Cannot drop X bit from directory.

(cherry picked from commit 2702e40)

* gh-94052: Don't re-run failed tests with --python option (#94054)

(cherry picked from commit 0ff7b99)

* Run Tools/scripts/reindent.py (#94225)

Reindent files which were not properly formatted (PEP 8: 4 spaces).

Remove also some trailing spaces.

(cherry picked from commit e87ada4)

* gh-94026: Buffer regrtest worker stdout in temporary file (GH-94253)

Co-authored-by: Victor Stinner <[email protected]>
(cherry picked from commit 199ba23)

* gh-96465: Clear fractions hash lru_cache under refleak testing (GH-96689)

Automerge-Triggered-By: GH:zware
(cherry picked from commit 9c8f379)

* gh-95027: Fix regrtest stdout encoding on Windows (#98492)

On Windows, when the Python test suite is run with the -jN option,
the ANSI code page is now used as the encoding for the stdout
temporary file, rather than using UTF-8 which can lead to decoding
errors.

(cherry picked from commit ec1f6f5)

* gh-98903: Test suite fails with exit code 4 if no tests ran (#98904)

The Python test suite now fails wit exit code 4 if no tests ran. It
should help detecting typos in test names and test methods.

* Add "EXITCODE_" constants to Lib/test/libregrtest/main.py.
* Fix a typo: "NO TEST RUN" becomes "NO TESTS RAN"

(cherry picked from commit c76db37)

* gh-100086: Add build info to test.libregrtest (#100093)

The Python test runner (libregrtest) now logs Python build information like
"debug" vs "release" build, or LTO and PGO optimizations.

(cherry picked from commit 3c89202)

* bpo-46523: fix tests rerun when `setUp[Class|Module]` fails (#30895)

Co-authored-by: Jelle Zijlstra <[email protected]>
Co-authored-by: Łukasz Langa <[email protected]>
(cherry picked from commit 9953860)

* gh-82054: allow test runner to split test_asyncio to execute in parallel by sharding. (#103927)

This runs test_asyncio sub-tests in parallel using sharding from Cinder. This suite is typically the longest-pole in runs because it is a test package with a lot of further sub-tests otherwise run serially. By breaking out the sub-tests as independent modules we can run a lot more in parallel.

After porting we can see the direct impact on a multicore system.

Without this change:
  Running make test is 5 min 26 seconds
With this change:
  Running make test takes 3 min 39 seconds

That'll vary based on system and parallelism. On a `-j 4` run similar to what CI and buildbot systems often do, it reduced the overall test suite completion latency by 10%.

The drawbacks are that this implementation is hacky and due to the sorting of the tests it obscures when the asyncio tests occur and involves changing CPython test infrastructure but, the wall time saved it is worth it, especially in low-core count CI runs as it pulls a long tail. The win for productivity and reserved CI resource usage is significant.

Future tests that deserve to be refactored into split up suites to benefit from are test_concurrent_futures and the way the _test_multiprocessing suite gets run for all start methods. As exposed by passing the -o flag to python -m test to get a list of the 10 longest running tests.

---------

Co-authored-by: Carl Meyer <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]> [Google, LLC]
(cherry picked from commit 9e011e7)

* Display the sanitizer config in the regrtest header. (#105301)

Display the sanitizers present in libregrtest.

Having this in the CI output for tests with the relevant environment
variable displayed will help make it easier to do what we need to
create an equivalent local test run.

(cherry picked from commit 852348a)

* gh-101634: regrtest reports decoding error as failed test (#106169)

When running the Python test suite with -jN option, if a worker stdout
cannot be decoded from the locale encoding report a failed testn so the
exitcode is non-zero.

(cherry picked from commit 2ac3eec)

* gh-108223: test.pythoninfo and libregrtest log Py_NOGIL (#108238)

Enable with --disable-gil --without-pydebug:

    $ make pythoninfo|grep NOGIL
    sysconfig[Py_NOGIL]: 1

    $ ./python -m test
    ...
    == Python build: nogil debug
    ...

(cherry picked from commit 5afe0c1)

* gh-90791: test.pythoninfo logs ASAN_OPTIONS env var (#108289)

* Cleanup libregrtest code logging ASAN_OPTIONS.
* Fix a typo on "ASAN_OPTIONS" vs "MSAN_OPTIONS".

(cherry picked from commit 3a1ac87)

* gh-108388: regrtest splits test_asyncio package (#108393)

Currently, test_asyncio package is only splitted into sub-tests when
using command "./python -m test". With this change, it's also
splitted when passing it on the command line:
"./python -m test test_asyncio".

Remove the concept of "STDTESTS". Python is now mature enough to not
have to bother with that anymore. Removing STDTESTS simplify the
code.

(cherry picked from commit 174e9da)

* regrtest computes statistics (#108793)

test_netrc, test_pep646_syntax and test_xml_etree now return results
in the test_main() function.

Changes:

* Rewrite TestResult as a dataclass with a new State class.
* Add test.support.TestStats class and Regrtest.stats_dict attribute.
* libregrtest.runtest functions now modify a TestResult instance
  in-place.
* libregrtest summary lists the number of run tests and skipped
  tests, and denied resources.
* Add TestResult.has_meaningful_duration() method.
* Compute TestResult duration in the upper function.
* Use time.perf_counter() instead of time.monotonic().
* Regrtest: rename 'resource_denieds' attribute to 'resource_denied'.
* Rename CHILD_ERROR to MULTIPROCESSING_ERROR.
* Use match/case syntadx to have different code depending on the
  test state.

Co-authored-by: Alex Waygood <[email protected]>
(cherry picked from commit d4e534c)

* gh-108822: Add Changelog entry for regrtest statistics (#108821)

---------

Co-authored-by: Christian Heimes <[email protected]>
Co-authored-by: Zachary Ware <[email protected]>
Co-authored-by: Nikita Sobolev <[email protected]>
Co-authored-by: Joshua Herman <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants