Project URL: https://www.andrew.cmu.edu/user/anaidu/
We plan to design and implement a parallel approach to the global mesh restructuring algorithm Loop Subdivision on an NVIDIA GPU, using CUDA to parallelize mesh geometry and topology updates. Our goal is to create a high quality mesh in less compute time.
Loop Subdivision is an algorithm that was created by Charles Loop in 1987 for upsampling triangle meshes. The algorithm relies on local mesh operations across all vertices by taking a triangle surface and subdividing it into four further triangles. The classical pseudocode is as follows:
function LOOP_SUBDIVIDE (V, F, num_iters): # where V = vertices, F = faces
for iter = 1 to num_iters:
# 0. Build adjacency (inherently complex and slow on CPU)
for each vertex i:
neighbors[i] = empty list
edge_faces = empty map from undirected edge to list of faces
for each face f in F:
(i, j, k) = f
# logic to build neighbors and edge_faces
# 1. Reposition old vertices (even vertices)
V_even = array of same size as V
for each vertex i:
# calculate beta and sumN using neighbors[i]
V_even[i] = (1 - n * beta) * V[i] + beta * sumN
# 2. Create new edge vertices (odd vertices)
V_new = empty list
# first put all even vertices into V_new
edge_point = map from edge -> index in V_new
for each edge e (i, j) in edge_faces.keys():
# calculate new edge point P based on interior/boundary rule
#... append P to V_new and record index in edge_point
# 3. Rebuild faces (each triangle -> 4 triangles)
F_new = empty list
for each face f in F:
#... look up indices of new edge points e_ij, e_jk, e_ki
# add 4 new triangles (a, e_ij, e_ki), (b, e_jk, e_ij), etc. to F_new
# 4. Prepare for next iteration
V = V_new
F = F_new
return (V, F)
This algorithm is not only creating more vertices and interpolating them against the old vertices but it also is meant to smooth out the surface as more and more subdivisions occur. Figure 1 shows an accurate representation for an isocahedron.
New edge points are similarly computed from the endpoints of the edge and the two opposite vertices of the adjacent triangles. These local rules guarantee that, away from extraordinary vertices, the limit surface is $C^2$ -continuous and visually smooth, which is why Loop subdivision shows up everywhere from film VFX and character modeling to academic test models.
The main challenge in parallelizing Loop subdivision on GPU comes from dealing with data dependencies and irregular memory access of the mesh structure, which doesn't necessarily go well with the GPU's memory-coalescing execution model.
The main geometry refinement involved reading the 1-ring neighborhood for every vertex. Common mesh structures like adjacency lists and half-edges rely on pointer/index chasing, which would result in poor spatial locality. This would cause non-coalesced memory access which would really limit performance on the GPU.
In the pseudocode for the general Loop subdivision algorithm, we build an adjacency list, and this is an inherently complex and slow process on the CPU. A challenge for us is to avoid this costly setup. For example we could use a flat structure, like an edge-based or index/vertex buffer that allows connectivity to be derived implicitly or calculated in a highly-parallel way.
When we rebuild faces and create new edges, we have thousands of threads concurrently defining the connectivity of the new mesh. Just naively assigning new vertex and face indices would lead to massive write conflicts and race conditions. Dealing this will require a non-blocking, parallel strategy to dynamically allocate indices, like using a parallel prefix sum.
Our goal is to take the standard sequential implementation of Loop subdivision and create a parallel implementation by leveraging CUDA.
We have chosen the NVIDIA GPU with CUDA because Loop subdivision is fundamentally a data-parallel algorithm whose computational complexity grows exponentially. This platform's high bandwidth memory is perfect for handling the rapidly increasing mesh size, and its ability to launch sufficient threads can effectively hide the memory latency inherent in mesh traversal.
| Week | Dates | Goal & Tasks |
|---|---|---|
| Week 1 | Nov 16 - Nov 22 | Setup & Baseline: Finalize GPU data structure design. Acquire a basic mesh parser. Implement and optimize a correct, single-threaded C++ CPU baseline implementation. |
| Week 2 | Nov 23 - Nov 29 | Geometry Kernel Implementation: Design and allocate GPU-friendly mesh data structures. Implement the new vertex position calculation kernel. Verify numerical correctness against the CPU baseline. |
| Week 3 | Nov 30 - Dec 6 | Optimization & Core Analysis [Milestone]: Implement the CUDA kernel for new face/connectivity calculation. Integrate all kernels and begin in-depth profiling and optimization. ] |
| Week 4 | Dec 7 - Dec 8 | Final Submission and Stretch: Run all final performance tests. Generate all required analysis and comparative data (graphs/visualizations). Wrap up final report. Experiment with data structures and/or across vertices and batched per-face implementations. |
We have mostly completed the goals outlined for Week 1 and 2 in the initial proposal.
We did not fully complete the mesh parser setup or the correctness verification of the Even Vertex kernel because the latter depends on the former. We have adjusted the remaining schedule accordingly to meet the final deadline.
| Week | Dates | Goal & Tasks |
|---|---|---|
| Week 1 | Nov 16 - Nov 22 | Completed: Baseline CPU implementation. |
| Week 2 | Nov 23 - Nov 29 | Completed: Geometry kernel implementation (even vertex repositioning). Finalized V/I buffer design. |
| Week 3 | Nov 30 - Dec 3 | AN: Complete robust mesh parser and verify Even Vertex kernel correctness. TW: Implement the parallel Odd Vertex Creation kernel. AN: Design and implement the non-blocking index allocation strategy. TW: Implement the Boundary Condition detection kernel. |
| Dec 4 - Dec 6 | TW: Implement the final Connectivity Update kernel (face rebuilding). AN: Integrate all three kernels (Geometry, Odd Creation, Connectivity). Both: Begin initial end-to-end correctness testing. | |
| Week 4 | Dec 7 | Both: Thorough performance profiling and optimization passes. Collect benchmark data across a range of mesh sizes. Generate comparative performance graphs. |
| Dec 8 | Both: Final report writing, preparing the demo (OBJ export and visualization). |
Our current progress suggests we should be able to meet all "Plan to Achieve" goals, provided we resolve remaining topological challenges efficiently.
We are currently on track, but several challenges related to performance and topology are a concern: