Initial Proposal
https://jrohozen.github.io/418-website/
The Barnes-Hut algorithm is an O(nlogn) algorithm to simulate galaxy evolution. The algorithm is performed over many time steps. Each time step is done serially, but we will parallelize within time steps. The main data structure for this algorithm is the quadtree. The quadtree divides stars up based on their location. Each leaf node corresponds to exactly one star. The interior nodes of the tree hold the aggregate mass and center of mass for its children nodes.
On each time step, the original Barnes-Hut algorithm builds the quadtree and then fills in aggregate mass and center of mass for each interior node. As a modification, our algorithm will also build an interaction list before iterating over particles. The interaction list will be used to calculate the forces on particles efficiently on a GPU.
The pseudocode for the algorithm we will use follows. This is the pseudocode given in lecture with the modification of the lines in red.
for each time step in simulation:
build tree structure
compute (aggregate mass, center-of-mass) for interior nodes
traverse tree to assemble interactions list
for each particle:
use interactions list to accumulate gravitational forces
update particle position based on gravitational forces
The portion of the algorithm that is of primary interest is the method of traversing the quadtree to compose interaction lists for each star in a way that performs well on a GPU.
It is difficult to efficiently traverse the tree in a GPU-friendly way. The part about Barnes-Hut that is GPU-friendly is when calculating the forces on a given star when the inputs to that calculation are already decided. Shared memory, cache locality, and SIMD registers can be leveraged to accelerate this process.
There will be a high communication to computation ratio especially if you include the time it takes to compose the interactions list. The portion computing the forces on the stars will be relatively quick once an interactions list is composed.
However, optimizing how to get to that point is the tricky part due to thread divergence when traversing a tree. Recursion and thread divergence are not ideal for a GPU because GPUs rely on doing the same or similar calculations many times. What’s interesting is that for these reasons, the dual-tree walk has a better asymptotic complexity than the single-tree approach, however, based on analysis done by Liu et al, the authors found that a single-tree implementation still performed better on a GPU. We want to compare our findings to those in the “Efficient GPU Tree Walk” paper and evaluate any differences we may find.
Jianqiao Liu, Michael Robson, Thomas Quinn, and Milind Kulkarni. 2019. Efficient GPU tree walks for effective distributed n-body simulations. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). Association for Computing Machinery, New York, NY, USA, 24–34. https://doi.org/10.1145/3330345.3330348
M. Burtscher and K. Pingali. "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm". Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
Article about Barnes-Hut: The Barnes-Hut Algorithm
Another article: The Barnes-Hut Galaxy Simulator and Barnes-Hut-Simulator/README.md at master · beltoforion/Barnes-Hut-Simulator · GitHub
We plan to achieve the following goals:
We hope to achieve the following goals:
We will use the GPUs in the GHC machines. We also have other (personal) devices with older GPUs, and we may test our code on these if we have time. We are choosing to use GPUs because we think it is an interesting question to push GPU performance for Barnes-Hut. One the one hand, the force calculation portion of Barnes-Hut is excellent for a GPU because it’s a lot of the same calculations run on a large set of data at once. However, the preparation of that data is not easily done on a GPU because it involves tree traversal and thread divergence. We want to see if we can reduce this bottleneck by finding a better way to compose the interactions lists on a GPU, or in parallel on a CPU.
Milestone Report
Date | Goal | Assignee |
Thursday, April 10, 11:59pm |
| Delaynie + Jacob |
Friday, April 18, 11:59pm |
| Delaynie + Jacob |
Tuesday, April 22, 11:59pm |
| Delaynie + Jacob |
Friday, April 25, 11:59pm |
| Delaynie + Jacob |
Monday April 28th, 11:59pm |
| Delaynie + Jacob |
April 29th, 5:30-8:30pm |
| Delaynie + Jacob |
So far, we have a serial implementation of Barnes-Hut and a program for generating inputs for validating our parallel implementations. Additionally, we have a Python program to visualize the simulation over time. We have spent a lot of time reading through several papers and websites, particularly focusing on how to implement tree-building and tree walks in parallel. We’ve brainstormed a data structure based off of Burtscher and Pingali’s paper that represents a quadtree as a set of arrays stored in shared memory.
Based off of our original proposal, we are a bit behind regarding code deliverables. Because of the last exam and Carnival, we didn’t spend enough time initially as we needed to on doing background research so that we would be prepared to write implementations of Barnes-Hut. We still plan to deliver both single-tree and dual-tree CUDA implementations of Barnes-Hut, along with runtime breakdowns, synchronization stalls, and speedup data comparing the two.
Our poster will mainly consist of graphs of the data we collected comparing the Barnes-Hut implementations. We also plan to include some images from our simulation visualizer.
Our main concern is our ability to write both implementations in time. Since we’re synthesizing ideas inspired by several different sources, our interfaces and data structures need to be carefully thought out so that they’re both compatible and efficient across sections of our code.