Final Report:
Efficient Force Calculation for Galaxy Simulation in CUDA
Delaynie McMillan
Jacob Rohozen
Tuesday, April 29, 2025
For this project, we set out to implement the Barnes-Hut galaxy simulation algorithm in CUDA. Project deliverables include a highly optimized force calculation kernel written in CUDA, along with previous iterations of that kernel. We wrote a serial implementation of the algorithm, and measured and compared the performance of our different force calculation kernels. Additionally, we investigate the impact that changing the number of stars and the theta parameter has on runtime and speedup. Analysis of the serial implementation runtime showed that force calculation took up over 95% of the total runtime. Therefore, the main focus of the project was to see how we could optimize the force calculation kernel, in particular.
The Barnes-Hut algorithm is an O(nlogn) algorithm to simulate star interaction and galaxy evolution over time. The algorithm is performed over many time steps. Time steps must be done in order, but an individual time step can be parallelized.
The algorithm takes in arrays of data that represent all the stars in the system. Specifically, these arrays characterize the mass, initial position, and initial velocity of each star in the system. We implemented the algorithm in 2D, so the position and velocity arrays contain 2D values.
On each time step, the net force on each star is computed and this force results in a change in position and velocity over time. In a naive implementation, the force calculations would be O(n2), with n being the number of stars since each star interacts with every other star (n * (n-1)). The Barnes-Hut algorithm uses an approximation to reduce the amount of force calculations, and the algorithm becomes O(nlogn).
The quadtree is the main data structure for Barnes-Hut. The quadtree divides stars up based on their locations in 2D space. The root of the quadtree represents the entire system and is the bounding box for the current step in the simulation. The quadtree is built by recursively subdividing quadrants of the bounding box until every quadrant contains no more than one star. The following diagram shows a set of stars on the left and the resulting quadtree on the right.
Image above from https://arborjs.org/docs/barnes-hut
The gray nodes (we will call them quads) have four pointers which can either point to another quad, a star, or nothing (empty). Each quad has a center of mass and a total mass. This information represents the aggregation of all stars within the bounds of that quad.
The operations performed on a quadtree include traversing the tree, adding a quad, and adding a star. Once the tree is fully constructed, force calculations can be performed. The benefit of the quadtree is that it allows for approximations. Instead of calculating the force on a single star from every other star, Barnes-Hut calculates the force on a single star from other quads if they are sufficiently far away.
To add a star to the tree, the star “starts” at the tree’s root. First, the algorithm determines which of the four quads at the root level the star belongs to. This is determined using the star’s location and the center of the root quad. If there’s already a star in that quad, the quad is further subdivided into four subquads, and the process repeats until the quad is small enough, and this star is the only star in its quad. Similarly, if it is determined that a star belongs in a certain quad, and that quad is already subdivided into smaller quads, we “follow” that quad until we arrive at an “empty” quad with no star. Below are illustrations of adding a star to an empty quad, and sub-dividing a quad before inserting a star.
Even though the Barnes-Hut approximation reduces the algorithmic complexity to O(nlogn), the force calculations part of the algorithm still takes more than 95% of the total runtime for a serial implementation. This is why we decided to optimize the force calculations.
At first glance, there is plenty of work that can be done in parallel. The net force on each star can be independently calculated, and the quadtree is not modified (only read) in this step. This means that no synchronization or atomics are needed for correctness. The same arithmetic steps are being done on large quantities of stars to calculate the net forces on them, which would be a great use of SIMD parallelism. This sounds good, but the Barnes-Hut algorithm maps poorly to GPU execution.
One problem is that force calculation has low arithmetic intensity. Another problem is that traversing the tree is highly irregular, and this leads to poor locality. Traversing a tree requires accessing a lot of different portions of the quadtree data structure, which results in poor cache locality, especially when one block has threads whose working sets of data do not overlap much. When implemented naively, these problems lead to the algorithm being bandwidth-limited because of poor data reuse. Because of this, our project involved figuring out ways to take advantage of shared memory between threads to reduce memory bandwidth requirements.
Additionally, the irregular nature of the tree causes another problem: thread divergence. In CUDA, all 32 threads in a warp execute in lockstep. When a warp is traversing a tree some threads will finish before others. Even if only one thread is still traversing the tree, all the other threads will sit idle and wait. Thread divergence reduces the efficiency of the GPU, leading to lower overall throughput.
Furthermore, the recursive nature of the Barnes-Hut algorithm maps poorly to the CUDA model.
A later section will discuss our approach to Barnes-Hut and making the force calculations step more amenable to GPU execution.
We started by implementing a serial Barnes-Hut algorithm in C++. This served to verify our parallel implementation as we built it up, and we also used the serial code as a benchmark for speedup comparison. The serial implementation had classes and methods for manipulating stars and quads. We did not spend time optimizing the serial version, but we don’t feel that it is unreasonably slow (as this would make for an unfair comparison). The serial code was run on the GHC machines.
We developed a class called Galaxy to randomly generate inputs for both the serial and the CUDA implementations. Galaxies can be made from a customizable size and customizable number of stars orbiting a central supermassive black hole (just another star with a lot more mass). A galaxy can be given an initial position and an initial velocity. This is how we generate multiple galaxies and have them collide. The results of the simulation can be seen using our Python visualizer, and it can output a gif of the simulation also.
Our parallel implementation is written in CUDA and was also run on the GHC machines, specifically on the NVIDIA GeForce RTX 2080 GPUs. These GPUs have 46 streaming multiprocessors, and this helped inform our kernel launches.
The first part of making an efficient GPU implementation was changing our data structures.
In our serial implementation, we used classes for stars and quads and pointers to objects to manipulate class fields. This required dynamically allocating memory for quads and later freeing that memory on every timestep because the quadtree must be rebuilt from updated star positions.
For our parallel implementation, we chose a more primitive data structure scheme as that gave us more explicit control over data locality. We used arrays, and this approach allowed us to allocate all memory only once for an entire simulation. This saves a lot of time that would otherwise be spent dynamically allocating memory on each timestep.
Drawing inspiration from Burtscher and Pingali’s paper, we broke our star and quad class properties into separate arrays. For example, instead of using a single array to hold all stars objects, we used one array for each property of the star (mass, position, velocity, force). Additionally, we have a separate array that keeps track of the location of the quads’ sub-quads.
For properties that both stars and quads share (i.e. mass and center of mass), we extended the arrays to make space to hold quad properties as well. Since arrays are now shared between stars and quads, we placed star indexes at the front of the array and quad indexes at the end of the array.
The data structure for representing the entire quadtree is an array of integers, the quads array. The quads array is basically an array of indexes of quads or stars. Instead of using pointers to objects in memory, we use array indexes that “point to” other parts of the array.
We allocate one quad for each star in the system, and each quad requires four integers (its four children). Since some of the arrays (mass and position) contain star and quad values, the quads array grows from the back towards the front as more quads are used. In summary we allocate (quad_limit = 2
num_stars) quads for the quads array. Then the number of integers allocated is 4
quad_limit. The following image shows how our data structures are arranged.
Stars in the simulation will be divided among all the threads launched in the tree build kernel. To be more specific, each thread will be responsible for adding a certain number of stars to the quadtree. Building the quadtree involves adding stars one by one, each time starting at the root of the quadtree (the root’s quad index is quad_limit - 1). Adding a star to any quad is similar to the serial implementation with additional steps for handling locks and atomics to ensure correctness. When one thread wants to insert either a star or more quads into an empty quad, atomic CAS is used to ensure that only one thread can successfully modify a single part of the tree at once.
Locks are necessary when updating the tree requires more than changing (swapping) a single value. This occurs when attempting to add a star to a quadrant already containing a star because this requires shuffling around some values. Before inserting our current star, we must “demote” the star that’s already in the quad, and keep “demoting” it until the quads are small enough such that both stars are in their own quad. In this case, the thread will attempt to acquire the lock by using atomic CAS to swap the current value (the other star’s index) with -2, signifying to other threads that the position is locked.
Since changes only occur at the leaf values of the quadtree (at stars or empty quadrants), we reduced the granularity of our locks to only child values (not entire quads). This allows other threads to make progress if they are not attempting to modify the same position.
The states maintained outside of the while loop for inserting a new quad are: the coordinates of the current quad we are trying to insert into, size of said quad, the index of that quad, and the tree depth. State is maintained so that should a CAS for inserting a star or locking a quad fail, the loop restarts and does not have to traverse the entire tree up to that point again. It simply resumes at the index and depth where it left off, once again reading the child value it is attempting to modify, casing on that value, and repeating until success.
While developing a scheme for building the quadtree, we drew a lot of inspiration from Burtscher and Pingali’s CUDA code for the Barnes-Hut algorithm. This helped us begin reasoning about the benefits and limitations of interacting with a tree using a GPU. The pseudocode for inserting stars is below:
// assume s1 is the star to be added
while (more stars to add) {
if (first time attempting to add star to tree) {
set current quad to root
find which of root’s children the star will follow
}
get child value of current quad
while (child is not a quad) {
traverse the tree
}
// child is not a quad: could be a star, empty, or locked
if (child is locked) {
// can’t do anything
} else if (child is empty) {
attempt to replace empty value with star value (atomic CAS)
if (atomic CAS successful) {
Move on to next star
}
} else { // child is a star
attempt to lock (atomic CAS)
if (lock acquired) {
s2 = child value
do {
"allocate" space for a new quad using atomicSub
// for new quad
determine which quadrant s1 and s2 should be added to
if (stars occupy separate quadrants) {
add stars to their respective quadrant
} else {
move on // need to try again for newly allocated quad
}
} while (s1 and s2 in same quad)
}
}
__threadfence() // ensures update has completed before releasing lock
if (lock held) {
release lock
put correct quad value back
// -2 is the lock value (not a valid quad index)
}
}
The design of the force calculations kernel was refined over many iterations. Along the way, we used the C++ chrono library to determine wall clock runtime and used the NVIDIA Nsight profiler to determine what we needed to optimize next.
Since recursion is not well-suited for GPU execution, our implementation uses a stack to traverse the quadtree. The stack is an array of integers, and it tracks which child value to look at next. We defined the maximum depth of the stack to be 32. This allowed us to statically allocate the stack, but also required us to limit the maximum depth when building the quadtree. A maximum depth of 32 still allows for a large quadtree with many levels.
The overall idea for each implementation is to statically assign stars to threads, and each thread computes the net force on each of its stars. The problem with statically assigning stars is that it results in bad load balancing. This is because the amount of work needed to calculate the net force on a star varies significantly depending on the star’s location, and not only is the amount of work not known at assignment time, it changes every timestep.
In the first implementation of the kernel, each thread had a stack all to itself, and each level of the stack had four spots to put child values. A -1 would represent an empty position in the stack. We launched this kernel with many blocks and 1024 threads per block. The pseudocode for this implementation follows.
// note that push and pop take a level argument
depth = 0
push(root, depth)
while (stack is not empty) {
child_val = pop(depth)
if (child_val == -1) {
// nothing to do
} else if (child is a star) {
// interact with this star
Calculate force from other star
} else if (child is a quad) {
if (far enough away) {
Calculate force from quad
} else {
depth++
push all the quad’s children onto stack // level has been incremented
}
}
if (level is empty) {
depth--
}
}
There are multiple performance problems with this initial implementation. Below is the part of the Nsight profile output for the force kernel.
As you can see, we had a very significant DRAM bottleneck. This is because force calculation has a low arithmetic intensity and there is no memory reuse in this implementation.
This implementation also suffers from thread divergence as thread work depends significantly on the locations of the stars.
The first improvement to the original kernel involved sharing the stack. We realized that since every thread in a warp executes instructions in lockstep, it makes sense for each thread in a warp to share the same stack. This means that each thread in the warp traverses the same part of the tree at the same time, so memory accesses are coalesced. This helps reduce memory bandwidth use because each thread can make use of the same memory.
To get this new design to work, we reduced the block size to 32 (the number of threads in a warp) and allocated a shared stack for each block (warp). Additionally, the stack was modified to only push one child value at a time, instead of all four,.
Each block has a “stack manager” (thread 0 in the block) that manipulates the stack for the rest of the threads. At the beginning, the stack manager also precomputes and saves values that depend only on depth in another shared array to avoid computation later.
Since each thread in the warp shares the stack, we needed a way to determine if the warp should compute forces on its stars from a quad or continue traversing the quad’s children. The CUDA warp-level function __anysync solved this problem by allowing any thread that needed to continue to force the other threads in the warp to continue.
An interesting outcome is that using __anysync results in the algorithm being slightly more accurate. Suppose one thread (we shall refer to as Thread 1) determines that a quad was far enough away from its own star, but another thread (referred to as Thread 2) determines that this same quad is not far enough away. Therefore, Thread 2 must continue traversing the tree at this particular point, while Thread 1 does not. However, it does not benefit Thread 1 to stop traversing the tree and instead use this quad’s aggregate values to calculate the force on its star because Thread 1 still has to wait on all the other threads that continue to traverse the tree. Furthermore, this would lead to a worse approximation of the star system because continuing down the quadtree is more accurate, and all threads in the warp must continue anyway. For this kernel design, this leads to a better approximation at no additional cost.
The resulting Nsight profile shows that the compute throughput increased and memory throughput decreased, as intended.
The next optimization for force calculation involved introducing another kernel - a sorting kernel. Sorting occurred before the force calculation kernel, and the output of the sort is an inorder traversal of the quadtree. We implemented this kernel serially, so the kernel itself is not very interesting. However, the sorting is crucial for the force calculation kernel. The inorder traversal results in stars that are nearby in space being nearby in the resulting array.
Since threads in the warp now calculate forces for stars that are closer together, there is significantly less thread divergence because each thread needs to traverse similar parts of the quadtree, and this results in a substantial speedup. Final speedup values and analysis are available in the results section.
Our final design limited us to using a block size of 32 for our kernel launch. We determined the number of blocks to launch empirically by sweeping over this value and comparing kernel runtimes. The final launch configuration was 46
8 = 368 blocks. The value 46 is significant because it is the number of streaming multiprocessors on a GeForce RTX 2080.
Unfortunately, our final design had poor occupancy. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps.
Increasing the block size would result in better occupancy because there would be more warps per block, and the theoretical occupancy is limited by the number of blocks. In our current implementation, this is not possible because the stack is only designed for a single warp.
Given more time, we would have improved our stack to work with more warps, enabling us to increase the block size. In this scenario, the shared stack would be “wider,” but each warp’s respective stack manager would only manipulate the part of the stack corresponding to its warp.
In addition to results and approaches listed above, we also took performance measurements of our final implementation, and how they are affected by problem size. Specifically, we investigated this implementation’s sensitivity to star count and the theta parameter.
We were most concerned with optimizing the force calculation step as this took the longest time to compute serially. We were less concerned with the overall speedup of the entire algorithm because not all of our kernels were optimized. In fact, the sorting kernel is executed serially. This would tend to reduce our overall performance and also cause us to underestimate the percentage of time spent on the force calculation kernel (compared to a parallel sort). Still, we achieve a significant speedup over our sequential implementation. If we had more time, we would implement a parallel sorting kernel.
We measured the impact that the number of stars has on force calculation time as well as total runtime for both our CUDA implementation and our serial implementation. To maintain consistency, all tests (both ones run on the CPU and GPU) were given the same seed for randomness. This ensures consistency in the initial star positions across tests. Additionally, all tests were run with a timestep size of 0.0001 years, computing 60 time steps, with a theta value of 0.2. Below are the results from experiments using 100-1,000,000 stars.
Serial Runtime vs Star Count | |||
Stars | Force Calculation Time (s) | Total Runtime (s) | Force Calculation Percent of Total Time |
100 | 0.0070 | 0.0081 | 85.9117% |
1,000 | 0.3541 | 0.3650 | 97.0243% |
10,000 | 7.9880 | 8.0994 | 98.6239% |
100,000 | 141.8995 | 143.6660 | 98.7704% |
1,000,000 | 3090.7263 | 3121.5092 | 99.0138% |
CUDA Runtime vs Star Count | |||
Stars | Force Calculation Time (s) | Total Runtime (s) | Force Calculation Percent of Total Time |
100 | 0.0044 | 0.1081 | 4.0835% |
1,000 | 0.0219 | 0.1484 | 14.7334% |
10,000 | 0.3179 | 0.6404 | 49.6382% |
100,000 | 4.5898 | 7.0570 | 65.0385% |
1,000,000 | 68.7030 | 96.2523 | 71.3780% |
The results above demonstrate that, as one would expect, increasing star count increases runtime. Our implementation exhibits good speedup compared to our serial implementation for large amounts of stars. In practice, it doesn’t make sense to use small amounts of stars on GPUs since they can launch thousands of threads.
We found it interesting that on a log scale for both time and star count, the CUDA implementation and serial implementation had very similar slopes–they were just offset by a consistent factor of 10.
We also looked at the impact that star count has on the speedup of our force calculation kernel. We found a clear trend where the speedup increases as the number of stars (aka the problem size) increases. Our implementation was able to reach up to a 44.9x speedup at 1,000,000 stars. Although there are some difficulties when running an algorithm that’s rather recursive and non-uniform on a GPU, the high quantity of cores and SIMD parallelism makes these tradeoffs worth it, especially for large problem sizes. It is definitely viable to do Barnes-Hut on a GPU, provided the algorithm is optimized enough. Below is a table and a graph of our data.
Force Calculation Speedup | |
Stars | Speedup |
100 | 1.5774 |
1,000 | 16.1950 |
10,000 | 25.1280 |
100,000 | 30.9164 |
1,000,000 | 44.9868 |
The main limitations to speedup are thread divergence, memory bandwidth, and inefficient use of the GPU streaming multiprocessors. Not every thread in a warp is always doing “useful” work. For example, all threads in a warp traverse all parts of the tree that only a subset of threads actually need to traverse. In this situation, the other threads in that warp could have just used aggregate calculations if they didn’t have to continue executing in lockstep. Since our implementation does not explicitly disable these threads, the Nsight profiler will report them as doing work, but in actuality they could have been doing something more useful for the progress of the simulation (such as calculating forces on another star).
As discussed in earlier sections, force calculation has low arithmetic intensity. We increased memory reuse in the force calculation kernel, and this helped speedup, but it is still a limiting factor. The following Nsight output shows our final values for throughput.
Finally, our implementation is limited by poor SM occupancy. As discussed in earlier sections, occupancy relates to the number of active warps. Our stack implementation limited our block size to 32 (the size of one warp), and this reduced our maximum possible occupancy. The Nsight output is repeated again here for convenience.
When computing the net force on a star exerted by the other stars in this system, the N-body O(n2) problem becomes an O(nlogn) problem due to the fact that a star cluster that is sufficiently far away from another star can be approximated as a single body exerting a force on that star.
The theta parameter is what is used to decide whether or not a quad is sufficiently far away from a given star and is sufficiently small enough. Given a star and a quad, if the side length of the quad divided by the distance between the star and the center of the quad is less than theta, then that star will treat that quad as an aggregate body exerting some force on it. Should the quotient not be less than theta, then the same process is repeated with quad’s sub-quads, until the quotient is less than theta or the quad’s child is a star.
Similar to the previous experiment, all tests were given the same seed for randomness, a timestep size of 0.0001 years, computing 60 time steps, with a star count of 1,000 stars. Below are the results for values of theta ranging from 0.10 to 0.40.
Serial Runtime (in seconds) vs Theta Value | |||
Theta | Force Calc | Total | Percentage |
0.10 | 0.5743 | 0.5836 | 98.3970% |
0.15 | 0.4154 | 0.4247 | 97.8124% |
0.20 | 0.3216 | 0.3319 | 96.8777% |
0.25 | 0.2483 | 0.2579 | 96.2422% |
0.30 | 0.2059 | 0.2156 | 95.5192% |
0.35 | 0.1710 | 0.1806 | 94.7004% |
0.40 | 0.1504 | 0.1599 | 94.0419% |
CUDA Runtime (in seconds) vs Theta Value | |||
Theta | Force Calc | Total | Percentage |
0.10 | 0.0262 | 0.1418 | 18.4882% |
0.15 | 0.0226 | 0.1105 | 20.4782% |
0.20 | 0.0201 | 0.1219 | 16.4496% |
0.25 | 0.0176 | 0.1014 | 17.3407% |
0.30 | 0.0153 | 0.0993 | 15.4262% |
0.35 | 0.01388 | 0.12502 | 11.1018% |
0.40 | 0.01246 | 0.11302 | 11.0289% |
As one would expect, increasing theta (effectively decreasing problem size) decreases overall runtime for both the CUDA and serial implementations. With the serial implementation, there is a strong, exponential downward trend on the time vs theta value graph. The speedup also gets worse because at smaller problem sizes, the CUDA implementation has more overhead to deal with, relative to the actual force calculations that are happening. Not only that, since there are more threads deciding that they don’t need to further traverse the tree, the amount of necessary work being done is also decreasing. Overall, fewer threads are deeming it necessary to traverse the tree further. Below is the speedup data we collected, depicting a downwards trend as theta increases.
Force Calculation Speedup | |
Theta | Speedup |
0.10 | 21.89972455 |
0.15 | 18.34834341 |
0.20 | 16.03389539 |
0.25 | 14.11259068 |
0.30 | 13.44601996 |
0.35 | 12.3206768 |
0.40 | 12.06294906 |
Overall, we were pleased with our final implementation and performance results. We showed that an optimized GPU kernel for force calculation is a viable method to implement Barnes-Hut in parallel, with speedup increasing drastically as problem size increases.
In our undertaking of this project, we set out to build and optimize a Barnes-Hut implementation in CUDA. Based on our findings while writing and testing this code, we are still left with ideas on how to further improve our implementation, along with some more questions about the effects certain changes could have on our performance.
If we had more time, we would like to:
Jianqiao Liu, Michael Robson, Thomas Quinn, and Milind Kulkarni. 2019. Efficient GPU tree walks for effective distributed n-body simulations. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). Association for Computing Machinery, New York, NY, USA, 24–34. https://doi.org/10.1145/3330345.3330348
M. Burtscher and K. Pingali. "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm". Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
https://userweb.cs.txstate.edu/~mb92/research/ECL-BH/ - Barnes-Hut CUDA code
https://arborjs.org/docs/barnes-hut - diagrams and understanding of the Barnes-Hut algorithm
https://beltoforion.de/en/barnes-hut-galaxy-simulator/ - more Barnes-Hut descriptions and pseudocode
Initial Proposal
https://jrohozen.github.io/418-website/
The Barnes-Hut algorithm is an O(nlogn) algorithm to simulate galaxy evolution. The algorithm is performed over many time steps. Each time step is done serially, but we will parallelize within time steps. The main data structure for this algorithm is the quadtree. The quadtree divides stars up based on their location. Each leaf node corresponds to exactly one star. The interior nodes of the tree hold the aggregate mass and center of mass for its children nodes.
On each time step, the original Barnes-Hut algorithm builds the quadtree and then fills in aggregate mass and center of mass for each interior node. As a modification, our algorithm will also build an interaction list before iterating over particles. The interaction list will be used to calculate the forces on particles efficiently on a GPU.
The pseudocode for the algorithm we will use follows. This is the pseudocode given in lecture with the modification of the lines in red.
for each time step in simulation:
build tree structure
compute (aggregate mass, center-of-mass) for interior nodes
traverse tree to assemble interactions list
for each particle:
use interactions list to accumulate gravitational forces
update particle position based on gravitational forces
The portion of the algorithm that is of primary interest is the method of traversing the quadtree to compose interaction lists for each star in a way that performs well on a GPU.
It is difficult to efficiently traverse the tree in a GPU-friendly way. The part about Barnes-Hut that is GPU-friendly is when calculating the forces on a given star when the inputs to that calculation are already decided. Shared memory, cache locality, and SIMD registers can be leveraged to accelerate this process.
There will be a high communication to computation ratio especially if you include the time it takes to compose the interactions list. The portion computing the forces on the stars will be relatively quick once an interactions list is composed.
However, optimizing how to get to that point is the tricky part due to thread divergence when traversing a tree. Recursion and thread divergence are not ideal for a GPU because GPUs rely on doing the same or similar calculations many times. What’s interesting is that for these reasons, the dual-tree walk has a better asymptotic complexity than the single-tree approach, however, based on analysis done by Liu et al, the authors found that a single-tree implementation still performed better on a GPU. We want to compare our findings to those in the “Efficient GPU Tree Walk” paper and evaluate any differences we may find.
Jianqiao Liu, Michael Robson, Thomas Quinn, and Milind Kulkarni. 2019. Efficient GPU tree walks for effective distributed n-body simulations. In Proceedings of the ACM International Conference on Supercomputing (ICS '19). Association for Computing Machinery, New York, NY, USA, 24–34. https://doi.org/10.1145/3330345.3330348
M. Burtscher and K. Pingali. "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm". Chapter 6 in GPU Computing Gems Emerald Edition, pp. 75-92. January 2011.
Article about Barnes-Hut: The Barnes-Hut Algorithm
Another article: The Barnes-Hut Galaxy Simulator and Barnes-Hut-Simulator/README.md at master · beltoforion/Barnes-Hut-Simulator · GitHub
We plan to achieve the following goals:
We hope to achieve the following goals:
We will use the GPUs in the GHC machines. We also have other (personal) devices with older GPUs, and we may test our code on these if we have time. We are choosing to use GPUs because we think it is an interesting question to push GPU performance for Barnes-Hut. One the one hand, the force calculation portion of Barnes-Hut is excellent for a GPU because it’s a lot of the same calculations run on a large set of data at once. However, the preparation of that data is not easily done on a GPU because it involves tree traversal and thread divergence. We want to see if we can reduce this bottleneck by finding a better way to compose the interactions lists on a GPU, or in parallel on a CPU.
Milestone Report
Date | Goal | Assignee |
Thursday, April 10, 11:59pm |
| Delaynie + Jacob |
Friday, April 18, 11:59pm |
| Delaynie + Jacob |
Tuesday, April 22, 11:59pm |
| Delaynie + Jacob |
Friday, April 25, 11:59pm |
| Delaynie + Jacob |
Monday April 28th, 11:59pm |
| Delaynie + Jacob |
April 29th, 5:30-8:30pm |
| Delaynie + Jacob |
So far, we have a serial implementation of Barnes-Hut and a program for generating inputs for validating our parallel implementations. Additionally, we have a Python program to visualize the simulation over time. We have spent a lot of time reading through several papers and websites, particularly focusing on how to implement tree-building and tree walks in parallel. We’ve brainstormed a data structure based off of Burtscher and Pingali’s paper that represents a quadtree as a set of arrays stored in shared memory.
Based off of our original proposal, we are a bit behind regarding code deliverables. Because of the last exam and Carnival, we didn’t spend enough time initially as we needed to on doing background research so that we would be prepared to write implementations of Barnes-Hut. We still plan to deliver both single-tree and dual-tree CUDA implementations of Barnes-Hut, along with runtime breakdowns, synchronization stalls, and speedup data comparing the two.
Our poster will mainly consist of graphs of the data we collected comparing the Barnes-Hut implementations. We also plan to include some images from our simulation visualizer.
Our main concern is our ability to write both implementations in time. Since we’re synthesizing ideas inspired by several different sources, our interfaces and data structures need to be carefully thought out so that they’re both compatible and efficient across sections of our code.