Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of parametrized pulse evaluation #48

Conversation

wshanks
Copy link

@wshanks wshanks commented Mar 16, 2022

I played around with some different things on the upgrade/serializable-parametric-pulse branch. I thought a PR might be a good format for discussing what I saw. Here are some findings:

  1. For sympy, the really slow step was trying to simplify the symbolic expression. Since we just end up evaluating the function any way, simplification seems unnecessary if it ends up taking more time than it might save with a simpler expression. Removing simplify() reduced the time needed to draw the schedule for the whole circuit in the performance check notebook from too long for me to wait (I think you said 10 minutes) to 2.7 seconds.
  2. I got further savings for sympy by switching from the list comprehension to direct evaluation of the numpy array. That dropped the time to draw the circuit's schedule from 2.7 seconds to 0.6 seconds.
  3. I encountered one issue with the constant function function after making these changes. sympy simplifies the expression to the point that it is scalar, so the lambda just returns a scalar instead of an array. I handled this case with special code that checks for an array. It worked for the notebook but maybe there is a better way.
  4. Did you test the symengine case much? For me, it can't handle the full circuit schedule or the drag pulse. From further testing, I concluded that symengine does not support complex numbers in lambdify. This might be something to look into further (ask the symengine developers?). I also find it odd that the signature of lambdify is different for sympy and symengine (expression vs. iterable of expressions).
  5. symengine did not have the issue of lambdify returning a function that returns a scalar when there is no t dependence.
  6. For Gaussian and GaussianSquare, symengine was more than 10 time slower than sympy for the notebook cell tests.
  7. symengine is probably slower becuase I left in the simplify() calls for it. I found that if I did not use simplify() that I got RuntimeError: Not Implemented for Gaussian, GaussianSquare, and Constant. This is the same error that I see for Drag (it's from a Cython file, so a bit tricky to debug). Perhaps there is something complex in the expressions that gets simplified to real or maybe it is something else.

I have not looked into qiskit's stance on sympy and symengine. Is it necessary to support both independently? Is casting symengine to sympy an option? I think symengine has a function for this but I don't know if qiskit needs to support having either one installed without the other. I think the symengine developers are pretty responsive if we have specific issues (but I don't know how much work these issues could require).

@wshanks wshanks requested a review from nkanazawa1989 as a code owner March 16, 2022 21:14
@nkanazawa1989
Copy link
Owner

nkanazawa1989 commented Mar 17, 2022

Thanks Will for the detailed investigation. This report is really helpful.

I got further savings for sympy by switching from the list comprehension to direct evaluation of the numpy array. That dropped the time to draw the circuit's schedule from 2.7 seconds to 0.6 seconds.

I didn't investigate sympy much because I believed it is slower (symengine is much faster in principle because this is cython based implementation), but 0.6 sec is pretty good number. Originally Qiskit parameter expression was based on Sympy but now it is migrating to symengine for performance reason.

For Gaussian and GaussianSquare, symengine was more than 10 time slower than sympy for the notebook cell tests.

This is indeed surprising result. Probably we can drop symengine support here (if lamdify works such nicely).

sympy simplifies the expression to the point that it is scalar, so the lambda just returns a scalar instead of an array.

Probably we can write the constant as piecewise, i.e. amp * Piecewise((1, 0<= t <= duration), (0, True))?

symengine is probably slower becuase I left in the simplify() calls for it.

Yes, I noticed this and wrote about the simplification issue of piecewise in the symengine.py github. However even if I removed simplify (this is not necessary if the function is not based on piecewise), it still very slow to visualize full circuit schedule.
symengine/symengine.py#393

@mtreinish
Copy link

Reading through all of this I think using symengine for this application won't work yet. I feel like if we can work with symengine to fix the issues ideally it would be a faster solution (we saw a 7-10x speedup in terra for ParameterExpression moving to symengine). But if we've found a performant solution that's sympy based I think we're ok to just go with that.

Yeah, for the symengine() with symplify() usage that will be super slow because it relies on calling sympy to do the symbolic simplification (at least that's what I saw when I looked at this a couple weeks ago) and sympy is slow. But the need for that is the issue that @nkanazawa1989 reported upstream and they're working on a fix to the underlying c++ lib.

I have not looked into qiskit's stance on sympy and symengine. Is it necessary to support both independently? Is casting symengine to sympy an option? I think symengine has a function for this but I don't know if qiskit needs to support having either one installed without the other. I think the symengine developers are pretty responsive if we have specific issues (but I don't know how much work these issues could require).

In general for qiskit we try to use symengine for performance, but we can't use it everywhere because symengine doesn't have full platform support for all the platforms that qiskit supports (mainly 32bit platforms) and building it from source is next to impossible for most users. So in general for qiskit we use symengine as the default when we can, but support a fall back to sympy if symengine isn't available. The one thing sympy is not allowed for in qiskit though is a module level import, that is because importing sympy is significantly slower to import than everything else in qiskit.

So for this I think relying on just sympy is fine here if we have a performant solution using it.

This works around an issue with sympy where its lambda returns a scalar
instead of an array when the expression evaluates to a scalar:

sympy/sympy#5642
@wshanks wshanks force-pushed the upgrade/serializable-parametric-pulse branch from 6eb57c4 to 6f2496a Compare March 17, 2022 13:56
@wshanks
Copy link
Author

wshanks commented Mar 17, 2022

I didn't investigate sympy much because I believed it is slower

Ah, okay. I thought you were looking at sympy because you had put the list comprehension in and the only benefit I saw to the list comprehension over direct evaluation of the numpy array was that it handled the case of the sympy lambda function returning a scalar.

Probably we can write the constant as piecewise, i.e. amp * Piecewise((1, 0<= t <= duration), (0, True))?

I tried this out and it worked for the pulses in the notebook. I pushed another commit to the branch in this PR with the change.

However even if I removed simplify (this is not necessary if the function is not based on piecewise),

When I remove simplify, Gaussian also stops working. I found that this is because amp is cast to complex in Gaussian.__init__ and Lambdify also does not seem to support complex numbers. I opened symengine/symengine.py#396 about symengine's Lambdify supporting complex numbers.

we saw a 7-10x speedup in terra for ParameterExpression moving to symengine

Do you know which kind of operations were sped up in the ParameterExpression benchmarks? I would have guessed simplification could be sped up a lot but symengine falls back to sympy like you point out. I guess it is numerically evaluating an expression that gets sped up?

I think the sympy lambdify prints the expression as a string and then calls exec on it with locals (like sin, exp, etc) imported from the requested backend (like numpy). Maybe the printing and exec are a little slow, but I would expect the resulting function to run about as fast as writing the expression directly in numpy when using the numpy backend. Perhaps it would help for ParametricPulse.get_waveform() to cache the lambda so that when evaluating a pulse multiple times (like drawing a circuit) it does not need to be generated for each usage (or just the final array could be cached but that could be larger). The caching would need to check that the pulse parameters had not changed.

So I think the current status is that:

  1. sympy's lambdify seems to be reasonably performant when not using simplify() and acting on numpy arrays. We could investigate caching if we wanted to make multiple calls to get_waveform() on the same pulse faster.
  2. symengine's lambdify is missing support for piecewise functions and complex values and we have issues opened for both of those cases (Piecewise cannot typecasted into number even if all conditions are evaluated. symengine/symengine.py#393, Support for complex numbers in expressions passed to Lambdify symengine/symengine.py#396).
  3. If we can assume sympy is installed (it is still in the requirements for qiskit-terra so that seems okay), then we could change the HAS_SYMENGINE branch in get_waveform() from:
    lambda_func = sym.lambdify(t, [assigned_expr])

to

    import sympy
    lambda_func = sympy.lambdify(t, sym.sympify(assigned_expr))

as a workaround until the issues are fixed in symengine. This would let everything else related to parametric pulses keep using symengine and just change the lambda function creation which is only used internally within the ParametricPulse class.

@wshanks
Copy link
Author

wshanks commented Mar 17, 2022

One other caveat -- I have only tried the performance notebook that @nkanazawa1989 committed in this branch. We should check further for edge cases. Hopefully, the qiskit-terra unit tests are sufficient?

@nkanazawa1989
Copy link
Owner

nkanazawa1989 commented Mar 17, 2022

Qiskit terra unittest is bit outdated and probably we need to review all the tests there. I don't think it does test for weird pulses. I feel just dropping symengine for pulse is good approach for now rather than converting it into sympy expression because it's in the requirement.
https://github.com/Qiskit/qiskit-terra/blob/e0e7cf8d0dfbd5918a5b58016212faf0aff4be62/requirements.txt#L6

Unfortunately the caching approach won't work because

pulse.play(Gaussian(160, 0.1, 40), channel_d0)
pulse.play(Gaussian(160, 0.1, 40), channel_d0)

will give you different instance. I'm not sure if we can turn get_waveform into class method and apply LRU cache on it.

@wshanks
Copy link
Author

wshanks commented Mar 18, 2022

Does

pulse.play(Gaussian(160, 0.1, 40), channel_d0)
pulse.play(Gaussian(160, 0.1, 40), channel_d0)

occur often in practice? I was thinking you would have an InstructionScheduleMap or Target with the schedule instances that would get reused as a circuit was processed.

For reference, here is a little bit of profiling that I tried:

git symengine sched gaussian gaussian_sq drag constant
1bf11fa True 0.38 0.00144 0.00684 0.00329 0.00108
1bf11fa False 0.403 0.00213 0.00742 0.00371 0.00109
44f794a True 0.281 0.000143 0.00025 0.000136 6.77e-05
2335f32 True 0.249 0.000148 0.000292 0.000156 0.000127
2335f32 False 0.252 0.000141 0.000387 0.000157 0.000122

Here "sched" is calling sched.draw() on the schedule from the notebook. The other columns are just calling get_waveform() instead of draw(). The first two rows are the second most recent commit in the PR here where the symengine case falls back to sympy before producing a lambda dynamically within the get_waveform() call. It seems like there is some hit from converting to sympy, but not too bad -- in the 10% range (actually, it looks like the symengine case is faster! running more though these might just be within the margin of error). The third row is the commit before @nkanazawa1989's changes in the branch (so some commit from terra main). There is quite a hit with the sympy lambdify method compared to the old way of implementing get_waveform() -- a factor of 20-40 slower.

To address the cost of the lambdify call, I added a new (messy) commit where I lambdify the symbolic expression on the first call to __init__ and store it in a class level variable, so it should only ever be generated once per Python process. In the test code, I instantiated each symbolic class once before the timed calls, so the time for the lambdify calls does not show up in the table. With this approach, the times are very close. We would need to measure more carefully to tell if there is an actual difference (the two symengine cases should be indistinguishable because the symengine part is in the lambdify that gets done before the measurement starts).

I feel just dropping symengine for pulse is good approach for now rather than converting it into sympy expression because it's in the requirement.

I don't have a strong feeling about this. It seems like handling symengine does not add much overhead. These symbolic pulse shapes are new so they are not connected to the rest of qiskit, but there might be some convenience to using the same symbolic format as other parameter expressions.

Qiskit terra unittest is bit outdated and probably we need to review all the tests there.

I did see 20 tests fail with my next to last commit. I did not try to dig into them to see why they failed. Some seemed to be just a deprecation warning? Some seemed to be a pickling issue with symengine. A couple seemed to involve pulse values not matching what was expected. The failures might give extra motivation for avoiding symengine for now.

Here is the exact code I used:

from subprocess import run
from time import perf_counter

from qiskit import circuit, pulse, transpile, schedule, quantum_info as qi
from qiskit.test.mock import FakeBogota

from qiskit.utils import optionals
optionals.HAS_SYMENGINE = False


backend = FakeBogota()

circ = circuit.QuantumCircuit(2)
su4 = qi.random_unitary(4, seed=123).to_instruction()
circ.compose(su4, [0, 1], inplace=True)
circ.measure_all()

REPS = 4


schedule(transpile(circ, backend), backend)

sched = schedule(transpile(circ, backend), backend)

# Warm up
sched.draw()
pulse.Constant(800, 0.1)
pulse.Gaussian(160, 0.1, 40)

results = [
    (
        "git",
        run(
            "git describe --tags --exact-match || git symbolic-ref -q --short HEAD || git rev-parse --short HEAD",
            capture_output=True,
            text=True,
            check=True,
            shell=True,
        ).stdout.strip(),
    ),
    ("symengine", optionals.HAS_SYMENGINE)
]

start = perf_counter()
for _ in range(REPS):
    sched.draw()
results.append(("sched", (perf_counter() - start)/REPS))

start = perf_counter()
for _ in range(REPS):
    gaussian = pulse.Gaussian(160, 0.1, 40)
    gaussian.get_waveform()
results.append(("gaussian", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    gaussian_sq = pulse.GaussianSquare(800, 0.1, 64, risefall_sigma_ratio=2)
    gaussian_sq.get_waveform()
results.append(("gaussian_sq", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    drag = pulse.Drag(160, 0.1, 40, 0.3)
    drag.get_waveform()
results.append(("drag", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    constant = pulse.Constant(800, 0.1)
    constant.get_waveform()
results.append(("constant", (perf_counter() - start)/REPS))

print("| " + " | ".join(r[0] for r in results) + " |")
print("| " + " | ".join(f"{r[1]:0.3g}" if isinstance(r[1], float) else str(r[1]) for r in results) + " |")

@nkanazawa1989
Copy link
Owner

nkanazawa1989 commented Mar 22, 2022

Thanks @wshanks this is really cool professional job. I was really impressed by the numbers in the table. I didn't realize we can use lamdify with more than one arguments (and only first one is array). Seems like adding lamdified function to class member as you suggest is reasonable approach. This will overcome the independent instance issue I raised above.

Another option would be converting _define method into class method and call lamdify in the __init_subclass__ so that lambda functions are generated when pulse module is loaded (as far as I can read lamdify method doesn't need to be an instance method). I can cover the test issues. I'll merge these commits and open a PR co-authored with you in terra if this PR is ready.

@nkanazawa1989 nkanazawa1989 merged commit 2af235d into nkanazawa1989:upgrade/serializable-parametric-pulse Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants