Simulation performance improvements #359
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While investigating why the Qulacs backend does not seem to use multiple threads in my code, I noticed that very little time is spent in the actual simulation.
For testing, I used the Quantum Fourier Transform. By running either just a QFT, or a QFT and then an inverse QFT, we can make the resulting wavefunction dense or sparse, which is important for testing the performance.
Code
Running this code on the
devel
branch takes ~160 seconds for the normal QFT and ~20 seconds for the combined QFT and inverse QFT on my machine. Note that the second one is much faster even though the simulated circuit is twice as large, this is because the result is a sparse wavefunction.When measuring the time of
self.circuit.update_quantum_state(state)
indo_simulate
insimulator_qulacs.py
(which I believe is the actual simulation), it's only ~0.17s / ~0.28s (QFT / QFT + inverse), so there is a lot of room for speedups.All the times here are simple measurements of single runs, and not proper benchmarks, but the differences are so large that it shouldn't matter if the numbers are inaccurate.
I made four commits which speed this up:
For a dense wavefunction, a lot of time is spent in the
apply_keymap
function insimulate
. This doesn't seem necessary when all qubits are active, since then the mapping has no effect. So I added a check that skips the keymap entirely in that case, reducing the runtime to ~24s / ~20s.The
numpy.isclose
call infrom_array
inqubit_wavefunction.py
is expensive, likely because it checks for edge cases (infinity, nan) and because it can handle arrays. Assuming we simply want to skip values close to zero, this can be done much faster using a simple comparison.Also, we only need to create the index bitstring when we actually want to set the value, so moving it inside the if clause saves a lot of time for sparse wavefunctions. These two changes reduce the runtime to ~6s / ~0.7s.
Still in the
from_array
function, if the numbering of the array doesn't match the numbering of the wavefunction, a keymap is applied to adjust it. This should not be necessary because theinitialize_bitstring
function should already return the correct type of bitstring. However currently the returned type depends on thenumbering_in
parameter, changing it to thenumbering_out
parameter allows removing the keymap and reduces the time to ~3.7s / ~0.6s.Reversing the bits in a bitstring is currently done by converting the integer to a binary string, reversing the characters, and converting them back to an integer. This is inefficient and can be replaced with a function that uses bit operations. An interpretation of this code is that each line flips a bit in the position of each bit, so when all position bits are flipped, the position p is moved to 31 - p. This assumes the integer has exactly 32 bit, so if it has less, this adds zeros which we can get rid of with a bitshift.
Another small optimization is to calculate the logarithm to get the bitstring length, instead of converting to a binary string and counting (the integer
bit_length()
method can't be used because some code passes Numpy ints). Together, this reduces the runtime to ~2.5s / ~0.6s.In total, the speedup is ~60x / ~30x for the script above, but is likely much less for other code. I have not tested if these speedups transfer to other backends. Further improvements are definitely possible, but I think I have covered a good amount of the low hanging fruit.