Improve performance of parametrized pulse evaluation #48

wshanks · 2022-03-16T21:14:16Z

I played around with some different things on the upgrade/serializable-parametric-pulse branch. I thought a PR might be a good format for discussing what I saw. Here are some findings:

For sympy, the really slow step was trying to simplify the symbolic expression. Since we just end up evaluating the function any way, simplification seems unnecessary if it ends up taking more time than it might save with a simpler expression. Removing simplify() reduced the time needed to draw the schedule for the whole circuit in the performance check notebook from too long for me to wait (I think you said 10 minutes) to 2.7 seconds.
I got further savings for sympy by switching from the list comprehension to direct evaluation of the numpy array. That dropped the time to draw the circuit's schedule from 2.7 seconds to 0.6 seconds.
I encountered one issue with the constant function function after making these changes. sympy simplifies the expression to the point that it is scalar, so the lambda just returns a scalar instead of an array. I handled this case with special code that checks for an array. It worked for the notebook but maybe there is a better way.
Did you test the symengine case much? For me, it can't handle the full circuit schedule or the drag pulse. From further testing, I concluded that symengine does not support complex numbers in lambdify. This might be something to look into further (ask the symengine developers?). I also find it odd that the signature of lambdify is different for sympy and symengine (expression vs. iterable of expressions).
symengine did not have the issue of lambdify returning a function that returns a scalar when there is no t dependence.
For Gaussian and GaussianSquare, symengine was more than 10 time slower than sympy for the notebook cell tests.
symengine is probably slower becuase I left in the simplify() calls for it. I found that if I did not use simplify() that I got RuntimeError: Not Implemented for Gaussian, GaussianSquare, and Constant. This is the same error that I see for Drag (it's from a Cython file, so a bit tricky to debug). Perhaps there is something complex in the expressions that gets simplified to real or maybe it is something else.

I have not looked into qiskit's stance on sympy and symengine. Is it necessary to support both independently? Is casting symengine to sympy an option? I think symengine has a function for this but I don't know if qiskit needs to support having either one installed without the other. I think the symengine developers are pretty responsive if we have specific issues (but I don't know how much work these issues could require).

nkanazawa1989 · 2022-03-17T01:37:26Z

Thanks Will for the detailed investigation. This report is really helpful.

I got further savings for sympy by switching from the list comprehension to direct evaluation of the numpy array. That dropped the time to draw the circuit's schedule from 2.7 seconds to 0.6 seconds.

I didn't investigate sympy much because I believed it is slower (symengine is much faster in principle because this is cython based implementation), but 0.6 sec is pretty good number. Originally Qiskit parameter expression was based on Sympy but now it is migrating to symengine for performance reason.

For Gaussian and GaussianSquare, symengine was more than 10 time slower than sympy for the notebook cell tests.

This is indeed surprising result. Probably we can drop symengine support here (if lamdify works such nicely).

sympy simplifies the expression to the point that it is scalar, so the lambda just returns a scalar instead of an array.

Probably we can write the constant as piecewise, i.e. amp * Piecewise((1, 0<= t <= duration), (0, True))?

symengine is probably slower becuase I left in the simplify() calls for it.

Yes, I noticed this and wrote about the simplification issue of piecewise in the symengine.py github. However even if I removed simplify (this is not necessary if the function is not based on piecewise), it still very slow to visualize full circuit schedule.
symengine/symengine.py#393

mtreinish · 2022-03-17T11:45:30Z

Reading through all of this I think using symengine for this application won't work yet. I feel like if we can work with symengine to fix the issues ideally it would be a faster solution (we saw a 7-10x speedup in terra for ParameterExpression moving to symengine). But if we've found a performant solution that's sympy based I think we're ok to just go with that.

Yeah, for the symengine() with symplify() usage that will be super slow because it relies on calling sympy to do the symbolic simplification (at least that's what I saw when I looked at this a couple weeks ago) and sympy is slow. But the need for that is the issue that @nkanazawa1989 reported upstream and they're working on a fix to the underlying c++ lib.

I have not looked into qiskit's stance on sympy and symengine. Is it necessary to support both independently? Is casting symengine to sympy an option? I think symengine has a function for this but I don't know if qiskit needs to support having either one installed without the other. I think the symengine developers are pretty responsive if we have specific issues (but I don't know how much work these issues could require).

In general for qiskit we try to use symengine for performance, but we can't use it everywhere because symengine doesn't have full platform support for all the platforms that qiskit supports (mainly 32bit platforms) and building it from source is next to impossible for most users. So in general for qiskit we use symengine as the default when we can, but support a fall back to sympy if symengine isn't available. The one thing sympy is not allowed for in qiskit though is a module level import, that is because importing sympy is significantly slower to import than everything else in qiskit.

So for this I think relying on just sympy is fine here if we have a performant solution using it.

This works around an issue with sympy where its lambda returns a scalar instead of an array when the expression evaluates to a scalar: sympy/sympy#5642

wshanks · 2022-03-17T16:11:04Z

I didn't investigate sympy much because I believed it is slower

Ah, okay. I thought you were looking at sympy because you had put the list comprehension in and the only benefit I saw to the list comprehension over direct evaluation of the numpy array was that it handled the case of the sympy lambda function returning a scalar.

Probably we can write the constant as piecewise, i.e. amp * Piecewise((1, 0<= t <= duration), (0, True))?

I tried this out and it worked for the pulses in the notebook. I pushed another commit to the branch in this PR with the change.

However even if I removed simplify (this is not necessary if the function is not based on piecewise),

When I remove simplify, Gaussian also stops working. I found that this is because amp is cast to complex in Gaussian.__init__ and Lambdify also does not seem to support complex numbers. I opened symengine/symengine.py#396 about symengine's Lambdify supporting complex numbers.

we saw a 7-10x speedup in terra for ParameterExpression moving to symengine

Do you know which kind of operations were sped up in the ParameterExpression benchmarks? I would have guessed simplification could be sped up a lot but symengine falls back to sympy like you point out. I guess it is numerically evaluating an expression that gets sped up?

I think the sympy lambdify prints the expression as a string and then calls exec on it with locals (like sin, exp, etc) imported from the requested backend (like numpy). Maybe the printing and exec are a little slow, but I would expect the resulting function to run about as fast as writing the expression directly in numpy when using the numpy backend. Perhaps it would help for ParametricPulse.get_waveform() to cache the lambda so that when evaluating a pulse multiple times (like drawing a circuit) it does not need to be generated for each usage (or just the final array could be cached but that could be larger). The caching would need to check that the pulse parameters had not changed.

So I think the current status is that:

sympy's lambdify seems to be reasonably performant when not using simplify() and acting on numpy arrays. We could investigate caching if we wanted to make multiple calls to get_waveform() on the same pulse faster.
symengine's lambdify is missing support for piecewise functions and complex values and we have issues opened for both of those cases (Piecewise cannot typecasted into number even if all conditions are evaluated. symengine/symengine.py#393, Support for complex numbers in expressions passed to Lambdify symengine/symengine.py#396).
If we can assume sympy is installed (it is still in the requirements for qiskit-terra so that seems okay), then we could change the HAS_SYMENGINE branch in get_waveform() from:

    lambda_func = sym.lambdify(t, [assigned_expr])

to

    import sympy
    lambda_func = sympy.lambdify(t, sym.sympify(assigned_expr))

as a workaround until the issues are fixed in symengine. This would let everything else related to parametric pulses keep using symengine and just change the lambda function creation which is only used internally within the ParametricPulse class.

wshanks · 2022-03-17T17:22:09Z

One other caveat -- I have only tried the performance notebook that @nkanazawa1989 committed in this branch. We should check further for edge cases. Hopefully, the qiskit-terra unit tests are sufficient?

nkanazawa1989 · 2022-03-17T17:29:41Z

Qiskit terra unittest is bit outdated and probably we need to review all the tests there. I don't think it does test for weird pulses. I feel just dropping symengine for pulse is good approach for now rather than converting it into sympy expression because it's in the requirement.
https://github.com/Qiskit/qiskit-terra/blob/e0e7cf8d0dfbd5918a5b58016212faf0aff4be62/requirements.txt#L6

Unfortunately the caching approach won't work because

pulse.play(Gaussian(160, 0.1, 40), channel_d0)
pulse.play(Gaussian(160, 0.1, 40), channel_d0)

will give you different instance. I'm not sure if we can turn get_waveform into class method and apply LRU cache on it.

wshanks · 2022-03-18T19:27:16Z

Does

pulse.play(Gaussian(160, 0.1, 40), channel_d0)
pulse.play(Gaussian(160, 0.1, 40), channel_d0)

occur often in practice? I was thinking you would have an InstructionScheduleMap or Target with the schedule instances that would get reused as a circuit was processed.

For reference, here is a little bit of profiling that I tried:

git	symengine	sched	gaussian	gaussian_sq	drag	constant
`1bf11fa`	True	0.38	0.00144	0.00684	0.00329	0.00108
`1bf11fa`	False	0.403	0.00213	0.00742	0.00371	0.00109
`44f794a`	True	0.281	0.000143	0.00025	0.000136	6.77e-05
`2335f32`	True	0.249	0.000148	0.000292	0.000156	0.000127
`2335f32`	False	0.252	0.000141	0.000387	0.000157	0.000122

Here "sched" is calling sched.draw() on the schedule from the notebook. The other columns are just calling get_waveform() instead of draw(). The first two rows are the second most recent commit in the PR here where the symengine case falls back to sympy before producing a lambda dynamically within the get_waveform() call. It seems like there is some hit from converting to sympy, but not too bad -- in the 10% range (actually, it looks like the symengine case is faster! running more though these might just be within the margin of error). The third row is the commit before @nkanazawa1989's changes in the branch (so some commit from terra main). There is quite a hit with the sympy lambdify method compared to the old way of implementing get_waveform() -- a factor of 20-40 slower.

To address the cost of the lambdify call, I added a new (messy) commit where I lambdify the symbolic expression on the first call to __init__ and store it in a class level variable, so it should only ever be generated once per Python process. In the test code, I instantiated each symbolic class once before the timed calls, so the time for the lambdify calls does not show up in the table. With this approach, the times are very close. We would need to measure more carefully to tell if there is an actual difference (the two symengine cases should be indistinguishable because the symengine part is in the lambdify that gets done before the measurement starts).

I feel just dropping symengine for pulse is good approach for now rather than converting it into sympy expression because it's in the requirement.

I don't have a strong feeling about this. It seems like handling symengine does not add much overhead. These symbolic pulse shapes are new so they are not connected to the rest of qiskit, but there might be some convenience to using the same symbolic format as other parameter expressions.

Qiskit terra unittest is bit outdated and probably we need to review all the tests there.

I did see 20 tests fail with my next to last commit. I did not try to dig into them to see why they failed. Some seemed to be just a deprecation warning? Some seemed to be a pickling issue with symengine. A couple seemed to involve pulse values not matching what was expected. The failures might give extra motivation for avoiding symengine for now.

Here is the exact code I used:

from subprocess import run
from time import perf_counter

from qiskit import circuit, pulse, transpile, schedule, quantum_info as qi
from qiskit.test.mock import FakeBogota

from qiskit.utils import optionals
optionals.HAS_SYMENGINE = False


backend = FakeBogota()

circ = circuit.QuantumCircuit(2)
su4 = qi.random_unitary(4, seed=123).to_instruction()
circ.compose(su4, [0, 1], inplace=True)
circ.measure_all()

REPS = 4


schedule(transpile(circ, backend), backend)

sched = schedule(transpile(circ, backend), backend)

# Warm up
sched.draw()
pulse.Constant(800, 0.1)
pulse.Gaussian(160, 0.1, 40)

results = [
    (
        "git",
        run(
            "git describe --tags --exact-match || git symbolic-ref -q --short HEAD || git rev-parse --short HEAD",
            capture_output=True,
            text=True,
            check=True,
            shell=True,
        ).stdout.strip(),
    ),
    ("symengine", optionals.HAS_SYMENGINE)
]

start = perf_counter()
for _ in range(REPS):
    sched.draw()
results.append(("sched", (perf_counter() - start)/REPS))

start = perf_counter()
for _ in range(REPS):
    gaussian = pulse.Gaussian(160, 0.1, 40)
    gaussian.get_waveform()
results.append(("gaussian", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    gaussian_sq = pulse.GaussianSquare(800, 0.1, 64, risefall_sigma_ratio=2)
    gaussian_sq.get_waveform()
results.append(("gaussian_sq", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    drag = pulse.Drag(160, 0.1, 40, 0.3)
    drag.get_waveform()
results.append(("drag", (perf_counter() - start)/REPS))


start = perf_counter()
for _ in range(REPS):
    constant = pulse.Constant(800, 0.1)
    constant.get_waveform()
results.append(("constant", (perf_counter() - start)/REPS))

print("| " + " | ".join(r[0] for r in results) + " |")
print("| " + " | ".join(f"{r[1]:0.3g}" if isinstance(r[1], float) else str(r[1]) for r in results) + " |")

nkanazawa1989 · 2022-03-22T19:55:15Z

Thanks @wshanks this is really cool professional job. I was really impressed by the numbers in the table. I didn't realize we can use lamdify with more than one arguments (and only first one is array). Seems like adding lamdified function to class member as you suggest is reasonable approach. This will overcome the independent instance issue I raised above.

Another option would be converting _define method into class method and call lamdify in the __init_subclass__ so that lambda functions are generated when pulse module is loaded (as far as I can read lamdify method doesn't need to be an instance method). I can cover the test issues. I'll merge these commits and open a PR co-authored with you in terra if this PR is ready.

Improve performance of parametrized pulse evaluation

9a8a582

wshanks requested a review from nkanazawa1989 as a code owner March 16, 2022 21:14

Implement Constant pulse using Piecewise

6f2496a

This works around an issue with sympy where its lambda returns a scalar instead of an array when the expression evaluates to a scalar: sympy/sympy#5642

wshanks force-pushed the upgrade/serializable-parametric-pulse branch from 6eb57c4 to 6f2496a Compare March 17, 2022 13:56

wshanks added 3 commits March 17, 2022 13:43

Fall back to sympy for parametric pulse evaluation

0424636

Use sympy Symbol with sympy lambdify

1bf11fa

WIP: cache symbolic pulse lambda functions

2335f32

nkanazawa1989 merged commit 2af235d into nkanazawa1989:upgrade/serializable-parametric-pulse Mar 23, 2022

wshanks mentioned this pull request Apr 7, 2022

Serializable parametric pulse Qiskit/qiskit#7821

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of parametrized pulse evaluation #48

Improve performance of parametrized pulse evaluation #48

wshanks commented Mar 16, 2022

nkanazawa1989 commented Mar 17, 2022 •

edited

Loading

mtreinish commented Mar 17, 2022

wshanks commented Mar 17, 2022

wshanks commented Mar 17, 2022

nkanazawa1989 commented Mar 17, 2022 •

edited

Loading

wshanks commented Mar 18, 2022 •

edited

Loading

nkanazawa1989 commented Mar 22, 2022 •

edited

Loading

Improve performance of parametrized pulse evaluation #48

Improve performance of parametrized pulse evaluation #48

Conversation

wshanks commented Mar 16, 2022

nkanazawa1989 commented Mar 17, 2022 • edited Loading

mtreinish commented Mar 17, 2022

wshanks commented Mar 17, 2022

wshanks commented Mar 17, 2022

nkanazawa1989 commented Mar 17, 2022 • edited Loading

wshanks commented Mar 18, 2022 • edited Loading

nkanazawa1989 commented Mar 22, 2022 • edited Loading

nkanazawa1989 commented Mar 17, 2022 •

edited

Loading

nkanazawa1989 commented Mar 17, 2022 •

edited

Loading

wshanks commented Mar 18, 2022 •

edited

Loading

nkanazawa1989 commented Mar 22, 2022 •

edited

Loading