Loop Subdivision on Cuda

Anoushka Naidu (anaidu) and Tim Wang (twang3) - November 18, 2025

Project URL: https://www.andrew.cmu.edu/user/anaidu/

Summary

We plan to design and implement a parallel approach to the global mesh restructuring algorithm Loop Subdivision on an NVIDIA GPU, using CUDA to parallelize mesh geometry and topology updates. Our goal is to create a high quality mesh in less compute time.

Background

Loop Subdivision is an algorithm that was created by Charles Loop in 1987 for upsampling triangle meshes. The algorithm relies on local mesh operations across all vertices by taking a triangle surface and subdividing it into four further triangles. The classical pseudocode is as follows:

Classical Pseudocode

function LOOP_SUBDIVIDE (V, F, num_iters): # where V = vertices, F = faces
  for iter = 1 to num_iters:
    # 0. Build adjacency (inherently complex and slow on CPU)
    for each vertex i:
      neighbors[i] = empty list
    edge_faces = empty map from undirected edge to list of faces
    for each face f in F:
      (i, j, k) = f
      # logic to build neighbors and edge_faces

    # 1. Reposition old vertices (even vertices)
    V_even = array of same size as V
    for each vertex i:
      # calculate beta and sumN using neighbors[i]
      V_even[i] = (1 - n * beta) * V[i] + beta * sumN

    # 2. Create new edge vertices (odd vertices)
    V_new = empty list
    # first put all even vertices into V_new
    edge_point = map from edge -> index in V_new
    for each edge e (i, j) in edge_faces.keys():
      # calculate new edge point P based on interior/boundary rule
      #... append P to V_new and record index in edge_point

    # 3. Rebuild faces (each triangle -> 4 triangles)
    F_new = empty list
    for each face f in F:
      #... look up indices of new edge points e_ij, e_jk, e_ki
      # add 4 new triangles (a, e_ij, e_ki), (b, e_jk, e_ij), etc. to F_new

    # 4. Prepare for next iteration
    V = V_new
    F = F_new

  return (V, F)

This algorithm is not only creating more vertices and interpolating them against the old vertices but it also is meant to smooth out the surface as more and more subdivisions occur. Figure 1 shows an accurate representation for an isocahedron.

Figure 1: Approaching continuous smoothing through repeated subdivision. H. Schumacher, "Loop subdivision on triangle meshes," *Wolfram Community*. [Online]. Available: https://community.wolfram.com/groups/-/m/t/1338790.

New edge points are similarly computed from the endpoints of the edge and the two opposite vertices of the adjacent triangles. These local rules guarantee that, away from extraordinary vertices, the limit surface is $C^2$ -continuous and visually smooth, which is why Loop subdivision shows up everywhere from film VFX and character modeling to academic test models.

Challenges

The main challenge in parallelizing Loop subdivision on GPU comes from dealing with data dependencies and irregular memory access of the mesh structure, which doesn't necessarily go well with the GPU's memory-coalescing execution model.

The main geometry refinement involved reading the 1-ring neighborhood for every vertex. Common mesh structures like adjacency lists and half-edges rely on pointer/index chasing, which would result in poor spatial locality. This would cause non-coalesced memory access which would really limit performance on the GPU.

In the pseudocode for the general Loop subdivision algorithm, we build an adjacency list, and this is an inherently complex and slow process on the CPU. A challenge for us is to avoid this costly setup. For example we could use a flat structure, like an edge-based or index/vertex buffer that allows connectivity to be derived implicitly or calculated in a highly-parallel way.

When we rebuild faces and create new edges, we have thousands of threads concurrently defining the connectivity of the new mesh. Just naively assigning new vertex and face indices would lead to massive write conflicts and race conditions. Dealing this will require a non-blocking, parallel strategy to dynamically allocate indices, like using a parallel prefix sum.

Resources

Platform: NVIDIA GeForce RTX 2080.
Language/API: C++ and CUDA.
Visualization & Demo: OpenGL for rendering results, utilizing the CUDA-OpenGL Interop functionality, or exporting results in OBJ format to view in Blender or another 3D application.

Goals and Deliverables

Our goal is to take the standard sequential implementation of Loop subdivision and create a parallel implementation by leveraging CUDA.

Plan to Achieve (Must-Haves)

Implement all three parallelizable phases of Loop subdivision on CUDA: calculating new vertex positions, creating new edge vertices, and updating the connectivity.
Successfully implement and use a GPU-friendly mesh representation/data structure.
Achieve at least 10x speedup over a single-threading C++ CPU implementation for sizable meshes.
Utilize OpenGL or support OBJ filetype export to visualize results.
Perform a Performance Analysis across Model Size, charting the robustness of the algorithm (we envision speedup improving with more vertices).

Hope to Achieve (Stretch Goals)

Across Vertices Implementation: Implement the subdivision pipeline using a thread-per-vertex launch strategy.
Batched Per-Face Implementation: Implement the subdivision pipeline using a thread-per-face launch strategy.
Perform thorough analysis comparing the performance of the above two approaches against our hybrid primary approach.
Different Representations of Geometry: Experiment with implicit and various explicit representations (half-edge, adjacency list, vertex/index buffer) to explore performance tradeoffs.

Platform Choice

We have chosen the NVIDIA GPU with CUDA because Loop subdivision is fundamentally a data-parallel algorithm whose computational complexity grows exponentially. This platform's high bandwidth memory is perfect for handling the rapidly increasing mesh size, and its ability to launch sufficient threads can effectively hide the memory latency inherent in mesh traversal.

Schedule

Week	Dates	Goal & Tasks
Week 1	Nov 16 - Nov 22	Setup & Baseline: Finalize GPU data structure design. Acquire a basic mesh parser. Implement and optimize a correct, single-threaded C++ CPU baseline implementation.
Week 2	Nov 23 - Nov 29	Geometry Kernel Implementation: Design and allocate GPU-friendly mesh data structures. Implement the new vertex position calculation kernel. Verify numerical correctness against the CPU baseline.
Week 3	Nov 30 - Dec 6	Optimization & Core Analysis [Milestone]: Implement the CUDA kernel for new face/connectivity calculation. Integrate all kernels and begin in-depth profiling and optimization. ]
Week 4	Dec 7 - Dec 8	Final Submission and Stretch: Run all final performance tests. Generate all required analysis and comparative data (graphs/visualizations). Wrap up final report. Experiment with data structures and/or across vertices and batched per-face implementations.

Midpoint Post

Work Completed so Far

We have mostly completed the goals outlined for Week 1 and 2 in the initial proposal.

Sequential Implementation (Baseline): We finished and verified the single-threaded C++ baseline implementation of the Loop Subdivision algorithm, which is necessary for correctness verification and to provide the performance metric for our 10x speedup target. The implementation follows the four main steps: adjacency build up, even vertex repositioning, odd vertex creation, and connectivity update, using a standard adjacency list and a map-based edges-to-faces lookup table.
GPU-Friendly Data Structure Selection: We selected the Vertex and Index Buffer (V/I Buffer) as the primary GPU data structure. This flat, array-based representation is optimal for memory coalescing, which is a key requirement for performance on the NVIDIA GPU. We rejected pointer-based alternatives like Half-Edge and Adjacency List structures due to the highly irregular memory access they cause on the GPU.
Geometry Kernel Implementation: The CUDA kernel for Even (old) vertex repositioning has been implemented.

Updated Schedule and Remaining Work

We did not fully complete the mesh parser setup or the correctness verification of the Even Vertex kernel because the latter depends on the former. We have adjusted the remaining schedule accordingly to meet the final deadline.

Updated Schedule Table

Week	Dates	Goal & Tasks
Week 1	Nov 16 - Nov 22	Completed: Baseline CPU implementation.
Week 2	Nov 23 - Nov 29	Completed: Geometry kernel implementation (even vertex repositioning). Finalized V/I buffer design.
Week 3	Nov 30 - Dec 3	AN: Complete robust mesh parser and verify Even Vertex kernel correctness. TW: Implement the parallel Odd Vertex Creation kernel. AN: Design and implement the non-blocking index allocation strategy. TW: Implement the Boundary Condition detection kernel.
Week 3	Dec 4 - Dec 6	TW: Implement the final Connectivity Update kernel (face rebuilding). AN: Integrate all three kernels (Geometry, Odd Creation, Connectivity). Both: Begin initial end-to-end correctness testing.
Week 4	Dec 7	Both: Thorough performance profiling and optimization passes. Collect benchmark data across a range of mesh sizes. Generate comparative performance graphs.
Week 4	Dec 8	Both: Final report writing, preparing the demo (OBJ export and visualization).

Status of Goals and Deliverables

Our current progress suggests we should be able to meet all "Plan to Achieve" goals, provided we resolve remaining topological challenges efficiently.

CUDA Phases: The Even vertex kernel is implemented, with the two remaining topology kernels (odd vertex creation, connectivity update) being the focus of Week 3.
Data Structure: The GPU-friendly V/I buffer is designed and implemented.
Speedup: The 10x speedup goal is still achievable.
Visualization: We plan to use OBJ file export to demonstrate correctness at the poster session.
Stretch Goals: We prioritize the performance analysis across model size first, followed by the across vertices implementation.

Concerns

We are currently on track, but several challenges related to performance and topology are a concern:

Topological Race Conditions and Index Allocation: We need to find the most efficient non-blocking strategy (using either thrust::sort or CUDA Atomic Operations) to uniquely allocate indices for the new Odd vertices, as the synchronization overhead cost is currently an unknown.
Efficient Boundary Condition Detection: Detecting boundary vertices and edges on the GPU (information not explicit in the V/I buffer) requires a parallel pre-pass kernel. Implementing this without a major performance hit is a critical focus.
Warp Divergence and Memory Bandwidth Limits: The irregular nature of mesh connectivity may cause warp divergence, and the memory-intensive nature of Loop Subdivision could result in the implementation being heavily bottlenecked by global memory bandwidth, limiting the achievable speedup.