-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transpiling thin and long circuits requires large memory resources in execution #5895
Comments
If I directly use CX gates then the memory consumption is ~2Gb and it completes in a fraction of the time indicating that the issue is likely to be in the unroller. The memory is still not freed however. Moreoever, if I do parallel transpilation then the process is spawned with this additional memory attached, so that each process is larger in memory than it otherwise should be. With the CX circuit, this means that later processes started with 3.3Gb of memory usage verses just 800Mb if I did not transpile the CX circuit before hand. |
I'm able to reproduce this locally. Building the 1M swap circuit takes ~500MB and then after transpiling it the python process is using constant 5-6GB. Assuming a worst case where the circuit objects resident memory grows linearly with gate count the transpiled output would be ~1.5GB (3 cx for each swap) so you'd only be using 2 GB. What that feels like to me is that gc is not able to clean up a bunch of the intermediate objects created during transpile() and they're left sitting around inaccessible until the process exits.I tried manually running gc to force it to cleanup things and there was no change in size. My first guess is that this actually a retworkx bug, after doing some reading on the PyO3 (the rust python binding lib that retworkx uses) docs it looks like you need to implement the garbage collection protocol manually otherwise python won't know how to clean things up. Let me try putting together a patch for retworkx and test it. If this is what's going on, I'm hopefully going to have a retworkx release at some point next week so we can include this in that. |
Can you try running with Qiskit/rustworkx#258 for retworkx. For me this seems to fix the leaking memory so after |
I do not see any change with the above branch (updated from retworkx 0.7.1 to 0.8.0 branch). I am still seeing a max mem of 6.5Gb and a final amount of 6.3Gb. This is on Ubuntu 20.04. |
[EDITED] Interestingly, if I just run the above circuit on an empty pass manager pm = PassManager()
pm.run(qc) than immediately there is a 1.5gb memory overhead (~3x the original circuit size). |
Here is the profiling for an empty PM:
oddly, I see the opposite timing for dag -> vs circuit -> dag when running outside of the PM (dag -> circuit is faster). However, it seems that the underlying culprit is that the DAG structure is somehow takes 3-4x more memory than the original circuit does. |
Yeah max memory doesn't decrease with that change it still peaks at multiple GB. The difference for me locally was that when I ran in an interpreter without the retworkx PR the python process was still using ~6GB until it exitted, but when I used the retworkx PR branch it cleaned up. But I'm on my laptop now and I'm not able to reproduce that behavior anymore (and I'm thinking I was actually misreading the number of digits in the resident set size reported by the kernel). I still think the retworkx PR is useful but probably doesn't address this. |
Is there any news on this front? Doing circuits over 127Q quickly blows through the 8gb of memory on the IQX, i.e. memory is not freed, and I am assuming this is the cause. |
this probably needs a large refactor to change how data is stored, but it's on the radar |
So I took a quick look at this before (back in April) and forgot to comment here with my findings. So the issue here is primarily around how Running with a recent-ish main branch and using guppy3 to dump the heap contents you can see this pretty easily for example, after running the circuit creation the python heap shows:
then after running
But I'm not seeing the 3x memory growith as before. That being said I can only see the memory growth issue in interactive sessions, when I run a standalone script I'm not able to reproduce it. This kind of goes with the fact when I run garbage collection the heap drops down because it sees the references to the output circuit from
I think to fix this we will need to change how instructions are used (or at least what we put in every instance) because right now we end up with multiple copies of instructions everywhere because we can't reliably copy by reference as things can and do get mutated which would have unintended side effects. This actually something I looked at a really long time ago in: #1445 which just made what is now the definition a shared global for each gate class so we didn't keep a copy for every instruction. But I gave up on that PR because it wasn't feasible to use a global as the definition got modified (for example when conditions were added). It might be easier to do now as things have been restructured since Qiskit 0.6/0.7 when I last tried to do this. Ideally at least for standard gates we could make them global singletons and just use references for everything to reduce the memory overhead. |
I'm going to close this issue as complete with the introduction of singleton gates and the follow up that caches permutation of qargs and cargs in circuit instructions the memory overhead of the million swap circuit is drastically reduced. In my local testing the heap size after creating the circuit on |
Latest release
What is the expected enhancement?
requires 6.5Gb of memory when calling
execute
with optimization level 0. A similar usage is seen with onlyh
gates.It also appears that some memory is not freed upon finishing this process as if I run it again the memory usage recorded in the linux system monitor becomes approx double the value; the memory is only freed upon terminating the Python instance.
The text was updated successfully, but these errors were encountered: