Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpiling thin and long circuits requires large memory resources in execution #5895

Closed
nonhermitian opened this issue Feb 24, 2021 · 11 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@nonhermitian
Copy link
Contributor

nonhermitian commented Feb 24, 2021

Latest release

What is the expected enhancement?

qc = QuantumCircuit(2)
qc.h(0)
for _ in range(1000000):
    qc.swap(0,1)
qc.measure_all()

requires 6.5Gb of memory when calling execute with optimization level 0. A similar usage is seen with only h gates.

It also appears that some memory is not freed upon finishing this process as if I run it again the memory usage recorded in the linux system monitor becomes approx double the value; the memory is only freed upon terminating the Python instance.

@nonhermitian nonhermitian added the type: enhancement It's working, but needs polishing label Feb 24, 2021
@nonhermitian
Copy link
Contributor Author

If I directly use CX gates then the memory consumption is ~2Gb and it completes in a fraction of the time indicating that the issue is likely to be in the unroller. The memory is still not freed however. Moreoever, if I do parallel transpilation then the process is spawned with this additional memory attached, so that each process is larger in memory than it otherwise should be. With the CX circuit, this means that later processes started with 3.3Gb of memory usage verses just 800Mb if I did not transpile the CX circuit before hand.

@nonhermitian nonhermitian added bug Something isn't working and removed type: enhancement It's working, but needs polishing labels Feb 24, 2021
@mtreinish
Copy link
Member

I'm able to reproduce this locally. Building the 1M swap circuit takes ~500MB and then after transpiling it the python process is using constant 5-6GB. Assuming a worst case where the circuit objects resident memory grows linearly with gate count the transpiled output would be ~1.5GB (3 cx for each swap) so you'd only be using 2 GB. What that feels like to me is that gc is not able to clean up a bunch of the intermediate objects created during transpile() and they're left sitting around inaccessible until the process exits.I tried manually running gc to force it to cleanup things and there was no change in size.

My first guess is that this actually a retworkx bug, after doing some reading on the PyO3 (the rust python binding lib that retworkx uses) docs it looks like you need to implement the garbage collection protocol manually otherwise python won't know how to clean things up. Let me try putting together a patch for retworkx and test it. If this is what's going on, I'm hopefully going to have a retworkx release at some point next week so we can include this in that.

@mtreinish
Copy link
Member

Can you try running with Qiskit/rustworkx#258 for retworkx. For me this seems to fix the leaking memory so after transpile() it's not use appreciably more memory (I still need to check to see what (if any) performance impact there is from doing that).

@nonhermitian
Copy link
Contributor Author

nonhermitian commented Mar 2, 2021

I do not see any change with the above branch (updated from retworkx 0.7.1 to 0.8.0 branch). I am still seeing a max mem of 6.5Gb and a final amount of 6.3Gb. This is on Ubuntu 20.04.

@nonhermitian
Copy link
Contributor Author

nonhermitian commented Mar 2, 2021

[EDITED]

Interestingly, if I just run the above circuit on an empty pass manager

pm = PassManager()
pm.run(qc)

than immediately there is a 1.5gb memory overhead (~3x the original circuit size). However just doing circuit_to_dag seems to only consume about 200mb of memory. So it seems that the passmanager copying large amounts of data somewhere even if it is empty. It looks to be in the dag_to_circuit call where all the memory is being consumed.

@nonhermitian
Copy link
Contributor Author

nonhermitian commented Mar 2, 2021

Here is the profiling for an empty PM:

         123000557 function calls in 51.146 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000003   11.623    0.000   24.157    0.000 dagcircuit.py:352(apply_operation_back)
  4000012    4.440    0.000   12.972    0.000 copy.py:66(copy)
  2000006    2.657    0.000    2.657    0.000 {method 'predecessors' of 'retworkx.PyDiGraph' objects}
  2000006    2.125    0.000    2.125    0.000 {built-in method __new__ of type object at 0x55c24f5126e0}
  2000006    1.996    0.000    5.684    0.000 copy.py:269(_reconstruct)
  1000011    1.493    0.000    4.318    0.000 dagnode.py:32(__init__)
  1000003    1.383    0.000    8.039    0.000 quantumcircuit.py:843(_append)
  2000004    1.312    0.000    1.698    0.000 register.py:97(__repr__)
  2000006    1.147    0.000   14.119    0.000 instruction.py:326(__deepcopy__)
  2000004    1.127    0.000    2.825    0.000 bit.py:69(__repr__)
  4000012    1.099    0.000    3.485    0.000 {built-in method builtins.all}
  1000003    1.071    0.000    5.641    0.000 dagcircuit.py:322(_add_op_node)
        1    1.062    1.062   18.046   18.046 dag_to_circuit.py:17(dag_to_circuit)
  4000012    0.884    0.000    0.884    0.000 {built-in method builtins.getattr}
  1000003    0.871    0.000    1.225    0.000 quantumcircuit.py:938(_check_dups)
        1    0.855    0.855   32.489   32.489 circuit_to_dag.py:18(circuit_to_dag)
  3000007    0.852    0.000    1.647    0.000 quantumcircuit.py:948(<genexpr>)
 10000046    0.848    0.000    0.848    0.000 bit.py:73(__hash__)
  4000016    0.796    0.000    0.796    0.000 {method 'add_edge' of 'retworkx.PyDiGraph' objects}
  2000006    0.777    0.000    0.777    0.000 {method '__reduce_ex__' of 'object' objects}
  2000006    0.706    0.000    0.880    0.000 dagcircuit.py:293(_check_bits)
  9000068    0.687    0.000    0.687    0.000 {built-in method builtins.isinstance}
  1000003    0.625    0.000    3.562    0.000 quantumcircuit.py:944(_check_qargs)
  2000006    0.611    0.000    2.736    0.000 copyreg.py:87(__newobj__)
  2000006    0.607    0.000    0.607    0.000 {method 'update' of 'dict' objects}
  6000022    0.584    0.000    0.584    0.000 bit.py:47(register)
  6000027    0.579    0.000    0.579    0.000 register.py:73(name)
  6000018    0.576    0.000    0.576    0.000 {method 'get' of 'dict' objects}
  2000006    0.554    0.000   14.673    0.000 instruction.py:308(copy)

oddly, I see the opposite timing for dag -> vs circuit -> dag when running outside of the PM (dag -> circuit is faster). However, it seems that the underlying culprit is that the DAG structure is somehow takes 3-4x more memory than the original circuit does.

@mtreinish
Copy link
Member

I do not see any change with the above branch (updated from retworkx 0.7.1 to 0.8.0 branch). I am still seeing a max mem of 6.5Gb and a final amount of 6.3Gb. This is on Ubuntu 20.04.

Yeah max memory doesn't decrease with that change it still peaks at multiple GB. The difference for me locally was that when I ran in an interpreter without the retworkx PR the python process was still using ~6GB until it exitted, but when I used the retworkx PR branch it cleaned up. But I'm on my laptop now and I'm not able to reproduce that behavior anymore (and I'm thinking I was actually misreading the number of digits in the resident set size reported by the kernel). I still think the retworkx PR is useful but probably doesn't address this.

@nonhermitian
Copy link
Contributor Author

Is there any news on this front? Doing circuits over 127Q quickly blows through the 8gb of memory on the IQX, i.e. memory is not freed, and I am assuming this is the cause.

@ajavadia
Copy link
Member

this probably needs a large refactor to change how data is stored, but it's on the radar

@mtreinish
Copy link
Member

So I took a quick look at this before (back in April) and forgot to comment here with my findings. So the issue here is primarily around how Instruction objects are used, it's because each instruction in a circuit is it's own instance we end up keeping a lot of objects in memory even if they're all identical.

Running with a recent-ish main branch and using guppy3 to dump the heap contents you can see this pretty easily for example, after running the circuit creation the python heap shows:

Partition of a set of 6562312 objects. Total size = 522637101 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 3004004  46 192945584  37 192945584  37 list
     1 1000000  15 144000000  28 336945584  64 dict of
                                               qiskit.circuit.library.standard_gates.swap.SwapGate
     2 1136049  17 74023296  14 410968880  79 tuple
     3 1000000  15 48000000   9 458968880  88 qiskit.circuit.library.standard_gates.swap.SwapGate
     4 167091   3 24847710   5 483816590  93 str
     5  40331   1  7153604   1 490970194  94 types.CodeType
     6  76096   1  6123735   1 497093929  95 bytes
     7  36376   1  4947136   1 502041065  96 function
     8  12148   0  3915920   1 505956985  97 dict (no owner)
     9   3959   0  3867904   1 509824889  98 type
<1112 more rows. Type e.g. '_.more' to view.>

then after running PassManager().run(qc) the stack shows 2x the number of SwapGate objects:

>>> pm = PassManager()
>>> pm.run(qc)
<qiskit.circuit.quantumcircuit.QuantumCircuit object at 0x7faf5e8b87c0>
>>> h.heap()
Partition of a set of 11561892 objects. Total size = 899047140 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 5004016  43 313386768  35 313386768  35 list
     1 2000000  17 288000000  32 601386768  67 dict of
                                               qiskit.circuit.library.standard_gates.swap.SwapGate
     2 2135909  18 138013168  15 739399936  82 tuple
     3 2000000  17 96000000  11 835399936  93 qiskit.circuit.library.standard_gates.swap.SwapGate
     4 167091   1 24847710   3 860247646  96 str
     5  40329   0  7156519   1 867404165  96 types.CodeType
     6  76092   1  6123547   1 873527712  97 bytes
     7  36230   0  4927280   1 878454992  98 function
     8  12151   0  3916616   0 882371608  98 dict (no owner)
     9   3959   0  3868440   0 886240048  99 type
<1113 more rows. Type e.g. '_.more' to view.>

But I'm not seeing the 3x memory growith as before.

That being said I can only see the memory growth issue in interactive sessions, when I run a standalone script I'm not able to reproduce it. This kind of goes with the fact when I run garbage collection the heap drops down because it sees the references to the output circuit from PassManager().run() are dropped:

>>> gc.collect()
5004481
>>> h.heap()
Partition of a set of 6561852 objects. Total size = 522594697 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 3004001  46 192937032  37 192937032  37 list
     1 1000000  15 144000000  28 336937032  64 dict of
                                               qiskit.circuit.library.standard_gates.swap.SwapGate
     2 1135902  17 74012752  14 410949784  79 tuple
     3 1000000  15 48000000   9 458949784  88 qiskit.circuit.library.standard_gates.swap.SwapGate
     4 167092   3 24847763   5 483797547  93 str
     5  40329   1  7156519   1 490954066  94 types.CodeType
     6  76092   1  6123547   1 497077613  95 bytes
     7  36230   1  4927280   1 502004893  96 function
     8  12148   0  3915920   1 505920813  97 dict (no owner)
     9   3959   0  3868440   1 509789253  98 type
<1113 more rows. Type e.g. '_.more' to view.>

I think to fix this we will need to change how instructions are used (or at least what we put in every instance) because right now we end up with multiple copies of instructions everywhere because we can't reliably copy by reference as things can and do get mutated which would have unintended side effects. This actually something I looked at a really long time ago in: #1445 which just made what is now the definition a shared global for each gate class so we didn't keep a copy for every instruction. But I gave up on that PR because it wasn't feasible to use a global as the definition got modified (for example when conditions were added). It might be easier to do now as things have been restructured since Qiskit 0.6/0.7 when I last tried to do this. Ideally at least for standard gates we could make them global singletons and just use references for everything to reduce the memory overhead.

@mtreinish
Copy link
Member

I'm going to close this issue as complete with the introduction of singleton gates and the follow up that caches permutation of qargs and cargs in circuit instructions the memory overhead of the million swap circuit is drastically reduced. In my local testing the heap size after creating the circuit on main is now 35MB and after transpile it's sitting at 36MB. The peak resident memory size is still quite large through the transpilation but that's temporary as the process is running and we'll work on ways to decrease the overhead in future releases. Please feel free to re-open this if I'm missing something or there is something else to track with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: done
Development

No branches or pull requests

4 participants