Initial Proposal

Barnes-Hut Tree Walking Strategies in CUDA

Website URL

https://jrohozen.github.io/418-website/

Summary

We are going to implement two different tree-walking strategies for Barnes-Hut using CUDA.
We will optimize these strategies for the GHC GPUs and compare their results and speedups versus a serial CPU implementation of the Barnes-Hut algorithm on the GHC machines.
In particular, we will compare the performance of a dual-tree to a single-tree on a GPU.

Background

The Barnes-Hut algorithm is an O(nlogn) algorithm to simulate galaxy evolution. The algorithm is performed over many time steps. Each time step is done serially, but we will parallelize within time steps. The main data structure for this algorithm is the quadtree. The quadtree divides stars up based on their location. Each leaf node corresponds to exactly one star. The interior nodes of the tree hold the aggregate mass and center of mass for its children nodes.

On each time step, the original Barnes-Hut algorithm builds the quadtree and then fills in aggregate mass and center of mass for each interior node. As a modification, our algorithm will also build an interaction list before iterating over particles. The interaction list will be used to calculate the forces on particles efficiently on a GPU.

The pseudocode for the algorithm we will use follows. This is the pseudocode given in lecture with the modification of the lines in red.

for each time step in simulation:

build tree structure

compute (aggregate mass, center-of-mass) for interior nodes

traverse tree to assemble interactions list

for each particle:

use interactions list to accumulate gravitational forces

update particle position based on gravitational forces

The portion of the algorithm that is of primary interest is the method of traversing the quadtree to compose interaction lists for each star in a way that performs well on a GPU.

The Challenge

It is difficult to efficiently traverse the tree in a GPU-friendly way. The part about Barnes-Hut that is GPU-friendly is when calculating the forces on a given star when the inputs to that calculation are already decided. Shared memory, cache locality, and SIMD registers can be leveraged to accelerate this process.

There will be a high communication to computation ratio especially if you include the time it takes to compose the interactions list. The portion computing the forces on the stars will be relatively quick once an interactions list is composed.

However, optimizing how to get to that point is the tricky part due to thread divergence when traversing a tree. Recursion and thread divergence are not ideal for a GPU because GPUs rely on doing the same or similar calculations many times. What’s interesting is that for these reasons, the dual-tree walk has a better asymptotic complexity than the single-tree approach, however, based on analysis done by Liu et al, the authors found that a single-tree implementation still performed better on a GPU. We want to compare our findings to those in the “Efficient GPU Tree Walk” paper and evaluate any differences we may find.

Resources

Jianqiao Liu, Michael Robson, Thomas Quinn, and Milind Kulkarni. 2019. Efficient GPU tree walks for effective distributed n-body simulations. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). Association for Computing Machinery, New York, NY, USA, 24–34. https://doi.org/10.1145/3330345.3330348

M. Burtscher and K. Pingali. "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm". Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.

Article about Barnes-Hut: The Barnes-Hut Algorithm

Another article: The Barnes-Hut Galaxy Simulator and Barnes-Hut-Simulator/README.md at master · beltoforion/Barnes-Hut-Simulator · GitHub

Goals and Deliverables

We plan to achieve the following goals:

Implementation of Barnes-Hut using a dual-tree walking strategy in CUDA.
Implementation of Barnes-Hut using a single-tree walking strategy in CUDA.
Cache misses, synchronization stalls, runtime breakdowns, and speedup data for both implementations.
Program to write random particle inputs and validate the outputs.

We hope to achieve the following goals:

Implementation of a dual-tree walk using OpenMP to run on a CPU.
Program to visualize particle movement/galaxy evolution over time.

Platform Choice

We will use the GPUs in the GHC machines. We also have other (personal) devices with older GPUs, and we may test our code on these if we have time. We are choosing to use GPUs because we think it is an interesting question to push GPU performance for Barnes-Hut. One the one hand, the force calculation portion of Barnes-Hut is excellent for a GPU because it’s a lot of the same calculations run on a large set of data at once. However, the preparation of that data is not easily done on a GPU because it involves tree traversal and thread divergence. We want to see if we can reduce this bottleneck by finding a better way to compose the interactions lists on a GPU, or in parallel on a CPU.

Milestone Report

Revised Schedule

Date	Goal	Assignee
Thursday, April 10, 11:59pm	Serial implementation on CPU Program for generating inputs and validating implementations Worked on serial implementation input generation	Delaynie + Jacob
Friday, April 18, 11:59pm	Done: serial implementation, input generation, and visualization Finish single-tree CUDA implementation. Make progress on dual-tree CUDA implementation.	Delaynie + Jacob
Tuesday, April 22, 11:59pm	Collect and analyze data for single-tree CUDA implementation. Start outlining final report. Optimize single-tree CUDA implementation. Continue working on dual-tree CUDA implementation.	Delaynie + Jacob
Friday, April 25, 11:59pm	Finish dual-tree CUDA implementation. Collect and analyze data. Keep working on final report.	Delaynie + Jacob
Monday April 28th, 11:59pm	Generate, visualize, and analyze data Final Report Due	Delaynie + Jacob
April 29th, 5:30-8:30pm	Poster Session	Delaynie + Jacob

Current Progress

So far, we have a serial implementation of Barnes-Hut and a program for generating inputs for validating our parallel implementations. Additionally, we have a Python program to visualize the simulation over time. We have spent a lot of time reading through several papers and websites, particularly focusing on how to implement tree-building and tree walks in parallel. We’ve brainstormed a data structure based off of Burtscher and Pingali’s paper that represents a quadtree as a set of arrays stored in shared memory.

Goals and Deliverables

Based off of our original proposal, we are a bit behind regarding code deliverables. Because of the last exam and Carnival, we didn’t spend enough time initially as we needed to on doing background research so that we would be prepared to write implementations of Barnes-Hut. We still plan to deliver both single-tree and dual-tree CUDA implementations of Barnes-Hut, along with runtime breakdowns, synchronization stalls, and speedup data comparing the two.

Poster Session

Our poster will mainly consist of graphs of the data we collected comparing the Barnes-Hut implementations. We also plan to include some images from our simulation visualizer.

Concerns

Our main concern is our ability to write both implementations in time. Since we’re synthesizing ideas inspired by several different sources, our interfaces and data structures need to be carefully thought out so that they’re both compatible and efficient across sections of our code.