Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Channel.__hash__ in multiprocessing contexts #11251

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

jakelishman
Copy link
Member

Summary

Storing an explicit hash key is fragile in cases that a channel might be created in a different process to where it might be compared or the hash used, because the hash seeding can vary depending on how the new interpreter process was created, especially if it's not done by fork.

In this case, transmitting the stored _hash over pickle meant that a DriveChannel(0) created in the main process of a macOS runner could compare equal to a DriveChannel(0) created in a separate process (standard start method spawn) and pickled over the wire to the main process, but have different hashes, violating the Python data model.

Instead, we can just use the standard Python behaviour of creating the hash on demand when requested; this should typically be preferred unless absolutely necessary for critical performance reasons, because it will generally fail safe.

Details and comments

This fixes a test failure that appears on Mac and Windows runners locally (but for some reason not in CI) in test.python.compiler.test_transpiler.TestTranspileParallel.test_parallel_dispatch_lazy_cal_loading.

Storing an explicit hash key is fragile in cases that a channel might be
created in a different process to where it might be compared or the hash
used, because the hash seeding can vary depending on how the new
interpreter process was created, especially if it's not done by `fork`.

In this case, transmitting the stored `_hash` over pickle meant that a
`DriveChannel(0)` created in the main process of a macOS runner could
compare equal to a `DriveChannel(0)` created in a separate process
(standard start method `spawn`) and pickled over the wire to the main
process, but have different hashes, violating the Python data model.

Instead, we can just use the standard Python behaviour of creating the
hash on demand when requested; this should typically be preferred unless
absolutely necessary for critical performance reasons, because it will
generally fail safe.
@jakelishman jakelishman added stable backport potential The bug might be minimal and/or import enough to be port to stable Changelog: Bugfix Include in the "Fixed" section of the changelog mod: pulse Related to the Pulse module labels Nov 15, 2023
@jakelishman jakelishman requested review from eggerdj, wshanks and a team as code owners November 15, 2023 16:08
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

  • @Qiskit/terra-core
  • @nkanazawa1989

@coveralls
Copy link

coveralls commented Nov 15, 2023

Pull Request Test Coverage Report for Build 6880788672

  • 13 of 19 (68.42%) changed or added relevant lines in 5 files are covered.
  • 22 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.03%) to 85.902%

Changes Missing Coverage Covered Lines Changed/Added Lines %
qiskit/pulse/model/mixed_frames.py 0 1 0.0%
qiskit/pulse/model/pulse_target.py 2 4 50.0%
qiskit/pulse/model/frames.py 9 12 75.0%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/expr.rs 1 93.76%
qiskit/pulse/library/waveform.py 3 93.75%
crates/qasm2/src/lex.rs 6 91.92%
crates/qasm2/src/parse.rs 12 96.67%
Totals Coverage Status
Change from base Build 6879718232: -0.03%
Covered Lines: 65938
Relevant Lines: 76760

💛 - Coveralls

wshanks
wshanks previously approved these changes Nov 15, 2023
Copy link
Contributor

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. I have never thought about hash values getting compared across subprocesses with different hash seeds. It's nice that a test flagged this.

Checking the rest of the pulse code, should we fix Instruction while we are at it? That one takes more work to trigger the problem because it caches the hash on first call rather than in __init__.

def __hash__(self) -> int:
if self._hash is None:
self._hash = hash((type(self), self.operands, self.name))
return self._hash

@jakelishman
Copy link
Member Author

jakelishman commented Nov 15, 2023

I can do if you prefer:

[~/code/qiskit/terra] (qiskit/3.10) (main *)
jake@ninetales$ rg '_hash =' qiskit/pulse
qiskit/pulse/channels.py
99:        self._hash = hash((self.__class__.__name__, self._index))

qiskit/pulse/instructions/instruction.py
56:        self._hash = None
305:            self._hash = hash((type(self), self.operands, self.name))

qiskit/pulse/model/frames.py
42:        self._hash = hash((type(self), identifier))
53:        return type(self) is type(other) and self._hash == other._hash

qiskit/pulse/model/logical_elements.py
42:        self._hash = hash((self._index, type(self)))

qiskit/pulse/model/mixed_frames.py
50:        self._hash = hash((self._logical_element, self._frame))

Looks like there's a few places where it could be happening.

Our CI test suite didn't actually catch this, it's just something that happens locally for me. Matt and I are trying to track down why the CI isn't catching it, because as far as we can tell, it should be. It's probably something to do with multiprocessing detection. It popped up again when Matt was trying to get PGO set up for our wheel builds, which caused us to use GHA runners to run the test suite (as a load for profiling), and they also show the problem.

@wshanks
Copy link
Contributor

wshanks commented Nov 15, 2023

Looks like there's a few places where it could be happening.

Good catch. I was grepping the stable channel and missed the new instances in the new model package.

Our CI test suite didn't actually catch this, it's just something that happens locally for me. Matt and I are trying to track down why the CI isn't catching it, because as far as we can tell, it should be.

Linux defaults to using fork which you noted uses the same hash seed across subprocesses and Windows and macOS default to disabling multiprocessing for transpilation, so I don't think it is surprising that the CI didn't catch this, unless CI changes one of these defaults?

@jakelishman
Copy link
Member Author

We override the parallel default in the relevant test, so it should appear. On macOS, I'm running the same stestr run command that Azure CI runs (as far as I recall / saw), so in theory if I see it locally, CI should see it as well. It's all a bit weird.

The new models implementations of __eq__ are actually quite wrong and definitely need to be changed - __eq__ absolutely should not depend on hash as its principal source of equality; it's totally possible for there to be a hash collision without implying the two objects are equal. I can fix that in this PR - just running the test suite locally now. Good to know that that code is new, though, so doesn't need a release note, thanks.

@wshanks
Copy link
Contributor

wshanks commented Nov 15, 2023

It's all a bit weird.

What is the failure? I am guessing that self.assertEqual(added_cal, ref_cal) fails, but I thought that would just call __eq__ which doesn't use the hash (unlike the model classes that you pointed out).

@jakelishman
Copy link
Member Author

Yeah, that's the failure, and the trick is that __eq__ somewhere in there calls set(Tuple[Channel, ...]), and Python's set equality is implemented by effectively len(x) == len(y) and all(x_item in y for x_item in x), which is why the hash of the channels comes in.

This removes all caching of items' `hash`es.  This practice is quite
fraught in multiprocessing contexts, and should only be done when it is
absolutely performance critical.

In a couple of cases, the pulse objects were using the cached `hash` as
the main component of their `__eq__` methods, which is not correct; it's
totally valid to have hash collisions without implying that two objects
are equal.
@jakelishman
Copy link
Member Author

All objects now updated in the latest commit.

@wshanks
Copy link
Contributor

wshanks commented Nov 15, 2023

the trick is that eq somewhere in there calls set(Tuple[Channel, ...])

Right, that makes sense because we are focused on Channel.__eq__ here but the test is comparing ScheduleBlock objects that have channels nested inside of them.

Copy link
Contributor

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. We could mention Instruction in the release note but it is a pretty minor case.

@TsafrirA should approve since he just added the model classes. One thing I wonder about with those is if the Frame base class wants to codify storing an unhashed string identifier. With the changes here, the base class no longer indicates that subclasses need to be hashable.

@jakelishman
Copy link
Member Author

The magic methods interact a bit unusually with subclassing, especially __hash__. For safety in the data model, Python actually sets __hash__ = None for subclasses if they override __eq__; hashing isn't inherited if equality is overriden, and that's something a subclass would frequently do, so it can be a little misleading to think that a base class having __hash__ ensures everything is hashable.

Of course, well behaved subclasses should know that if they override equality they should override hashing as well, but I've seen enough examples across Qiskit to know that this is an area of the Python data model that's not super well understood by everyone.

@wshanks
Copy link
Contributor

wshanks commented Nov 15, 2023

I've seen enough examples across Qiskit to know that this is an area of the Python data model that's not super well understood by everyone.

Definitely something I forget about. Doing another code search of the pulse package, I see that RelativeBarrier and Call (which is deprecated any way) fall into this trap. Also, ScheduleBlock explicitly sets __hash__ = None though it doesn't need to.

@jakelishman
Copy link
Member Author

Yeah, even if you understand it fully, it's definitely very easy to overlook or get wrong. Hashing is not my favourite corner of Python as a language, for sure.

@TsafrirA
Copy link
Collaborator

Thanks @jakelishman . To be honest, I didn't have a good reason to use self._hash outside of the fact that the legacy Channel used that pattern.

@jakelishman
Copy link
Member Author

Thanks both - Tsafrir, are you happy for me to tag this for merge?

@TsafrirA
Copy link
Collaborator

@jakelishman sure.

@jakelishman jakelishman added this pull request to the merge queue Nov 15, 2023
Merged via the queue into Qiskit:main with commit 3c1a87c Nov 15, 2023
13 checks passed
mergify bot pushed a commit that referenced this pull request Nov 15, 2023
* Fix `Channel.__hash__` in multiprocessing contexts

Storing an explicit hash key is fragile in cases that a channel might be
created in a different process to where it might be compared or the hash
used, because the hash seeding can vary depending on how the new
interpreter process was created, especially if it's not done by `fork`.

In this case, transmitting the stored `_hash` over pickle meant that a
`DriveChannel(0)` created in the main process of a macOS runner could
compare equal to a `DriveChannel(0)` created in a separate process
(standard start method `spawn`) and pickled over the wire to the main
process, but have different hashes, violating the Python data model.

Instead, we can just use the standard Python behaviour of creating the
hash on demand when requested; this should typically be preferred unless
absolutely necessary for critical performance reasons, because it will
generally fail safe.

* Fix `hash` and equality in other pulse objects

This removes all caching of items' `hash`es.  This practice is quite
fraught in multiprocessing contexts, and should only be done when it is
absolutely performance critical.

In a couple of cases, the pulse objects were using the cached `hash` as
the main component of their `__eq__` methods, which is not correct; it's
totally valid to have hash collisions without implying that two objects
are equal.

(cherry picked from commit 3c1a87c)

# Conflicts:
#	qiskit/pulse/model/frames.py
#	qiskit/pulse/model/mixed_frames.py
#	qiskit/pulse/model/pulse_target.py
@jakelishman jakelishman deleted the fix-pulse-channel-hash branch November 15, 2023 21:35
jakelishman added a commit that referenced this pull request Nov 16, 2023
* Fix `Channel.__hash__` in multiprocessing contexts

Storing an explicit hash key is fragile in cases that a channel might be
created in a different process to where it might be compared or the hash
used, because the hash seeding can vary depending on how the new
interpreter process was created, especially if it's not done by `fork`.

In this case, transmitting the stored `_hash` over pickle meant that a
`DriveChannel(0)` created in the main process of a macOS runner could
compare equal to a `DriveChannel(0)` created in a separate process
(standard start method `spawn`) and pickled over the wire to the main
process, but have different hashes, violating the Python data model.

Instead, we can just use the standard Python behaviour of creating the
hash on demand when requested; this should typically be preferred unless
absolutely necessary for critical performance reasons, because it will
generally fail safe.

* Fix `hash` and equality in other pulse objects

This removes all caching of items' `hash`es.  This practice is quite
fraught in multiprocessing contexts, and should only be done when it is
absolutely performance critical.

In a couple of cases, the pulse objects were using the cached `hash` as
the main component of their `__eq__` methods, which is not correct; it's
totally valid to have hash collisions without implying that two objects
are equal.

(cherry picked from commit 3c1a87c)
github-merge-queue bot pushed a commit that referenced this pull request Nov 16, 2023
* Fix `Channel.__hash__` in multiprocessing contexts

Storing an explicit hash key is fragile in cases that a channel might be
created in a different process to where it might be compared or the hash
used, because the hash seeding can vary depending on how the new
interpreter process was created, especially if it's not done by `fork`.

In this case, transmitting the stored `_hash` over pickle meant that a
`DriveChannel(0)` created in the main process of a macOS runner could
compare equal to a `DriveChannel(0)` created in a separate process
(standard start method `spawn`) and pickled over the wire to the main
process, but have different hashes, violating the Python data model.

Instead, we can just use the standard Python behaviour of creating the
hash on demand when requested; this should typically be preferred unless
absolutely necessary for critical performance reasons, because it will
generally fail safe.

* Fix `hash` and equality in other pulse objects

This removes all caching of items' `hash`es.  This practice is quite
fraught in multiprocessing contexts, and should only be done when it is
absolutely performance critical.

In a couple of cases, the pulse objects were using the cached `hash` as
the main component of their `__eq__` methods, which is not correct; it's
totally valid to have hash collisions without implying that two objects
are equal.

(cherry picked from commit 3c1a87c)

Co-authored-by: Jake Lishman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: Bugfix Include in the "Fixed" section of the changelog mod: pulse Related to the Pulse module stable backport potential The bug might be minimal and/or import enough to be port to stable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants