Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky CI tests with Scipy 1.11.0 on Windows Python 3.11 #10345

Open
jakelishman opened this issue Jun 26, 2023 · 7 comments
Open

Flaky CI tests with Scipy 1.11.0 on Windows Python 3.11 #10345

jakelishman opened this issue Jun 26, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@jakelishman
Copy link
Member

Environment

  • Qiskit Terra version: main
  • Python version: Python 3.11
  • Operating system: Windows

What is happening?

CI is flaky on Windows / Python 3.11 since the release of Scipy 1.11.

How can we reproduce the issue?

Example CI traceback:

test.python.providers.test_backend_v2.TestBackendV2.test_5q_ghz_22_ecr_level_3_bidirectional_False
--------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "D:\a\1\s\test-job\Lib\site-packages\ddt.py", line 220, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test\python\providers\test_backend_v2.py", line 113, in test_5q_ghz
    tqc = transpile(qc, backend, optimization_level=opt_level)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\compiler\transpiler.py", line 375, in transpile
    _serial_transpile_circuit(

      File "D:\a\1\s\qiskit\compiler\transpiler.py", line 457, in _serial_transpile_circuit
    result = pass_manager.run(circuit, callback=callback, output_name=output_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 428, in run
    return super().run(circuits, output_name, callback)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 447, in wrapper
    return meth(*meth_args, **meth_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 154, in run
    return super().run(
           ^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passmanager.py", line 225, in run
    out_program = self._run_single_circuit(in_programs[0], callback, **metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passmanager.py", line 173, in _run_single_circuit
    out_program = pass_runner.run(input_program, callback=callback, **metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 226, in wrapper
    return meth(*meth_args, **meth_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 148, in run
    return super().run(
           ^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 235, in run
    passmanager_ir = self._run_pass_generic(
                     ^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 190, in _run_pass_generic
    passmanager_ir = self._run_pass_generic(
                     ^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\passmanager\passrunner.py", line 151, in _run_pass_generic
    passmanager_ir = self._run_base_pass(
                     ^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\runningpassmanager.py", line 177, in _run_base_pass
    new_dag = pass_.run(passmanager_ir)
              ^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\transpiler\passes\optimization\consolidate_blocks.py", line 123, in run
    or self.decomposer.num_basis_gates(unitary) < basis_count
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\quantum_info\synthesis\two_qubit_decompose.py", line 1414, in num_basis_gates
    a, b, c = weyl_coordinates(unitary)[:]
              ^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\qiskit\quantum_info\synthesis\weyl.py", line 65, in weyl_coordinates
    D = la.eigvals(Up.T @ Up)
        ^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 895, in eigvals
    return eig(a, b=b, left=0, right=0, overwrite_a=overwrite_a,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 249, in eig
    _check_info(info, 'eig algorithm (geev)',

      File "D:\a\1\s\test-job\Lib\site-packages\scipy\linalg\_decomp.py", line 1371, in _check_info
    raise LinAlgError(("%s " + positive) % (driver, info,))

    numpy.linalg.LinAlgError: eig algorithm (geev) did not converge (only eigenvalues with order >= 2 have converged)

What should happen?

Reliable CI.

Any suggestions?

I suspect / hope that this is just something like Scipy's internal ARPACK build being done with a new compiler that's tweaked things a little bit. Eigensystem routines are never 100% reliable, and we call them a lot, so it's pretty likely that it just so happens that one matrix in our test suite happens to be a bit unreliable.

In the longer term, we can potentially attempt to stabilise the routines by the insertion of some very slight amounts of noise if the initial decomposition fails, but we'd first need to work out if the Scipy 1.11 is actually significantly less stable than what we've already got; it could just be an unlucky isolated failure.

@jakelishman jakelishman added the bug Something isn't working label Jun 26, 2023
@levbishop
Copy link
Member

Are all the failures in weyl_coordinates?
If you can pull out a few failing examples I can try to dive in

@jakelishman
Copy link
Member Author

Yeah, the traceback I linked is the only failing example I've seen, and it only occurs on Windows.

@levbishop
Copy link
Member

levbishop commented Jun 26, 2023

in particular if it's only weyl_coordinates that fails we could go back to the original algorithm from TwoQubitWeylDecomposition.__new__() that uses eigh and should therefore be much more stable, if possibly a little slower

Or you could dust off your slower but non-randomized algorithm from a couple years ago :-) As I recall it was mainly slow because it was doing a lot of python-level array indexing and manipulation and therefore slow because less numpy-accelerated. I bet a rust implementation of that algorithm would be faster and more robust than both TwoQubitWeylDecomposition randomized hermitian eval/evec solver and the weyl_coordinates non-hermitian eval-only algorithm. :-)

@jakelishman
Copy link
Member Author

Thanks Lev. Yeah, that's a good point - we could definitely just do the eigensystem bits using the Scipy dynamically linked BLAS, then pass the array views down to Rust to sort out the orthonormalisation step. I think for now I'll pin scipy to get CI rolling again, and then I'll try and figure things out better when I've got a bit more of a moment - I've got a bunch of dynamics-circuits-related feature stuff that needs to take priority first.

@raynelfss
Copy link
Contributor

I was able to replicate this issue on my Windows machine by running transpile on a very big circuit (1024 qubits, 128 depth) with optimization level 3. I kept running into the same error occasionally on the following versions of scipy: 1.11.0, 1.10.1, 1.10.0, 1.9.3, and 1.9.2.

 numpy.linalg.LinAlgError: eig algorithm (geev) did not converge (only eigenvalues with order >= 2 have converged)

Side note: It should be noted that there were occasions where this issue didn't happen and that was whenever the ConsolidateBlocks ran a certain decomposer (num_basis_gates):
image
This happens whenever I installed a different version of scipy without restarting the kernel and it slows down the pass considerably. After restarting the environment/kernel, the error would still happen.

@jakelishman
Copy link
Member Author

Thanks Ray, that's (of sorts) good news for the Terra 0.25 release, because it means that the problem was pre-existing, so we're fine to ship with our requirements allowing Scipy 1.11. I can safely remove this from the 0.25 milestone, and I might be able to make that change that Lev's suggested above to hopefully fortify our eigensystem routines a little bit. It's a little easier now that we've built up a lot more infrastructure in Rust around accelerated compiled routines.

@jakelishman jakelishman removed this from the 0.25.0 milestone Jul 19, 2023
@raynelfss
Copy link
Contributor

I also found this error happening on Windows with Python 3.10 but only with any version with the 1.11 label. It appears that this error happens whenever a very large circuit is processed, but it is particularly more sensitive with anything in the 1.11 version of Scipy. I was able to transpile circuits with 512 qubits and a depth of 64 with 1.10.1, 1.10.0, and 1.9.3 without any errors, but not with 1.11.1 and 1.11.0.

For now, I can only recommend we run CI on Windows machines with Python<=3.10 and with scipy<=1.10.1.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Jul 21, 2023
In CI we're seeing a reliable failure on windows when scipy is calling
LAPACK. This appears to be unrelated to the change in this PR branch and
is isolated to specific windows environments. This commit adds a skip
condition to skip that one test to workaround the failure in CI. The
wider issue with scipy compatibility on windows is being tracked in
issue Qiskit#10345, when we have a conclusion to that and can reliably run
this test we should remove this skip condition.
github-merge-queue bot pushed a commit that referenced this issue Jul 21, 2023
* Allow control flow in opt levels 2 and 3.

* Add release note.

* Enable testing for opt 2 and 3.

* Remove guard on translation_method='synthesis'.

* Fix test for synthesis guard.

* Skip o3 qpy full path transpile test on windows

In CI we're seeing a reliable failure on windows when scipy is calling
LAPACK. This appears to be unrelated to the change in this PR branch and
is isolated to specific windows environments. This commit adds a skip
condition to skip that one test to workaround the failure in CI. The
wider issue with scipy compatibility on windows is being tracked in
issue #10345, when we have a conclusion to that and can reliably run
this test we should remove this skip condition.

---------

Co-authored-by: Matthew Treinish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants