Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 1: Carolina Zheng #21

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,26 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Carolina Zheng
* Tested on: Windows 7, i7-6700 @ 3.40GHz 16GB, Quadro K620 (Moore 100 Lab)

### (TODO: Your README)
### Performance Analysis
(Sorry, will add screenshot and GIF soon)

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
FPS was measured using CUDA timers on the simulation kernels only.

#### Number of boids vs. FPS
![](images/boids-fps.png)

Changing the number of boids decreased performance for all three simulation types. This was the result I expected, since as the number of boids increases, so does the number of threads and, more importantly, the number of neighbors that need to be checked when velocity is updated.

#### Block size vs. FPS
![](images/blocksize-fps.png)

Changing the block size didn't have a noticeable impact on performance for all three simulation types. This makes sense to me, since I wouldn't expect any of the kernel operations for this assignment to stall, nor did the kernels I write use an excess amount of memory.

#### Coherent vs. scattered uniform grid
My coherent grid was slower than the scattered grid. Initially, I was surprised by this, since I thought that eliminating a layer of indirection would be an optimization in terms of performance, but after thinking about it, my coherent grid could have been slower because the execution time for the extra kernel that sorted position and velocity outweighed the benefits of contiguous memory access of those two arrays. Also, there's a possibility that I implemented the nested loops incorrectly and didn't fully achieve contiguous memory access.

#### Varying cell width
These results are not graphed, but they were interesting. For the scattered grid, decreasing cell width and increasing the number of neighbors checked decreased performance by a factor of about 20%, whereas for the coherent grid, it *increased* performance by about the same factor. My best explanation for these results is that with the increased number of neighbors, the benefits of contiguous memory access for the coherent grid were magnified, whereas the scattered grid did not have this advantage, and it would need to check more entries in the cell start and end index arrays.
Binary file added images/blocksize-fps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/boids-fps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading