-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean gem5 code #2
Comments
dhschall
added a commit
that referenced
this issue
Dec 24, 2021
Does not work at this moment Change-Id: I97ce92f799f132d9800774940ab9d5233f747358
dhschall
pushed a commit
that referenced
this issue
Feb 21, 2022
In the GPU VIPER TCC, programs with mixes of atomics and data accesses to the same address, in the same kernel, can experience deadlock when large applications (e.g., Pannotia's graph analytics algorithms) are running on very small GPUs (e.g., the default 4 CU GPU configuration). In this situation, deadlocks occur due to resource stalls interacting with the behavior of the current implementation for handling races between atomic accesses. The specific order of events causing this deadlock are: 1. TCC is waiting on an atomic to return from directory 2. In the meantime it receives another atomic to the same address -- when this happens, the TCC increments number of atomics to this address (numAtomics = 2) that are pending in TBE, and does a write through of the atomic to the directory. 3. When the first atomic returns from the Directory, it decrements the numAtomics counter. numAtomics was at 2 though, because of step #2. So it doesn't deallocate the TBE entry and calls Event:AtomicNotDone. 4. Another request (a LD) to the same address comes along for the same address. The LD does z_stall since the second atomic is pending –- so the LD retries every cycle until the deadlock counter times out (or until the second atomic comes back). 5. The second atomic returns to the TCC. However, because there are so many LD's pending in the cache, all doing z_stall's and retrying every cycle, there are a lot of resource stalls. So, when the second atomic returns, it is forced to retry its operation multiple times -- and each time it decrements the atomicDoneCnt flag (which was added to catch a race between atomics arriving and leaving the TCC in 7246f70) repeatedly. As a result atomicDoneCnt becomes negative. 6. Since this atomicDoneCnt flag is used to determine when Event:AtomicDone happens, and since the resource stalls caused the atomicDoneCnt flag to become negative, we never complete the atomic. Which means the pending LD can never access the line, because it's stuck waiting for the atomic to complete. 7. Eventually the deadlock threshold is reached. To fix this issue, this commit changes the VIPER TCC protocol from using z_stall to using the stall_and_wait buffer method that the Directory-level of the SLICC already uses. This change effectively prevents resource stalls from dominating the TCC level, by putting pending requests for a given address in a per-address stall buffer. These requests are then woken up when the pending request returns. As part of this change, this change also makes two small changes to the Directory-level protocol (MOESI_AMD_BASE-dir): 1. Updated the names of the wakeup actions to match the TCC wakeup actions, to avoid confusion. 2. Changed transition(B, UnblockWriteThrough, U) to check all stall buffers, as some requests were being placed later in the stall buffer than was being checked. This mirrors the changes in 187c44f to other Directory transitions to resolve races between GPU and DMA requests, but for transitions prior workloads did not stress. Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51367 Reviewed-by: Jason Lowe-Power <[email protected]> Reviewed-by: Matthew Poremba <[email protected]> Maintainer: Jason Lowe-Power <[email protected]> Tested-by: kokoro <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Istream model:
Workload hooks
The text was updated successfully, but these errors were encountered: