Mesh Generation from Point Cloud in Parallel

Adit Namdev


Introduction
Computer Graphics is a field that tends to require large amounts of data and extensive computation. Parallelism becomes almost a necessity. We follow Hoppe’s thesis on Surface Reconstruction of Unorganized Points[1] for a serial algorithm to generate a mesh from a point cloud. The bottlenecks with respect to timing are identified and addressed via data preprocessing optimizations, data sharing, and parallelization efforts. The basic approach of this method is to use the proximity of points to generate a series of planes from which distance to the ‘surface’ of our point cloud can be measured. From there, we uniformly sample the bounding box of the point cloud to interpolate where locations of zero distance are to the surface (and hence create a mesh representing this surface). Further discussion of this algorithm is in the design header below. The key contribution of our work is that it parallelizes using CPU resources instead of GPU resources. Through OpenMP, we show effective reconstruction of meshes that run on average five times faster than their serial versions. Usually, graphics takes advantage of GPU parallelism due to many repetitive operations on similar inputs (SIMD like operations) such as processing textures for every vertex in a mesh, however we hypothesized that the large amount of ‘memory hopping’ required by this algorithm in finding nearest neighbors may actually make CPU parallelism perform better. To actually test this, a future direction would be to see whether or not CUDA implementations make our algorithm faster.

Design
Algorithm
First, we should make clear that our algorithm takes in a list of 3D vertices (our point cloud) along with a ⍴ value and uses this information to output a manifold or watertight mesh. The purpose of ⍴ is relative to how close the points in the input point cloud are on average and how dense the mesh is (no exact formula, just need to try a few values to see what looks best). This is necessary because meshes can have different coordinate sizes and also meshes with more detail will more likely require precise ⍴ estimates to allow for the signed distance function to be accurate. ⍴ is used in determining how many nearest neighbors you consider and how finely you sample your point cloud’s bounding box when approximating the mesh. However, if the ⍴-value is too small, then fewer near neighbors can result in inaccurate approximations. Thus, a happy medium between precision and neighbor sampling is needed. The effect of different ⍴-values on the Sphere and Cube meshes are shown in Figure 10 and Figure 11. The goal of Hoppe’s algorithm is to approximate the surface function. From this, it is possible to perform contour tracing to make edges between points. Step 5 of this algorithm is not from Hoppe, however it is a necessary step after computing edges. The algorithm is as follows:

  1. Approximate planes tangent to the surface:

  2. Construct a Minimum Spanning Tree (MST) to propagate consistent normal vectors:

  3. Quantize the point cloud’s bounding box into a series of cubes and estimate the signed distance function from all cube corners:

  4. Perform 2D interpolation and contour tracing to construct the mesh in parallel:

  5. Vertex/Edge deduplication. Because our algorithm treats cubes in a parallel format, we must delete duplications made by neighboring cubes.

Bottlenecks and Solutions
Step 1 of the algorithm required a near-neighbor search to find nearby points in order to approximate a tangent plane. This was initially an O(n2) algorithm where n is the total number of points in the point cloud. This proved to be a very computationally expensive computation on point clouds with 10000 or more points. We addressed this issue with a preprocessing step. The space was quantized into a uniform octree where each cube had a side length of ⍴. The points were then binned into a cube in the octree based on their quantized location. This step is of O(n) complexity. Now when searching for neighboring points within a distance ⍴, we only needed to search the points within a single cube layer of the current point. The smaller the value of ⍴, the more quantized cubes there are. This means that in most cases there are fewer points to query during the near-neighbors search. This solution led to a considerable reduction in time spent approximating tangent planes. The direct improvement from incorporating multiple processors here is shown in Figures 6-9 under the tag “Tangent Plane Approximation”.

Step 2 of the algorithm consists of a substep of finding the k-Nearest-Neighbors. This has the same issue as Step 1 discussed above. We employ the same uniform octree method described, however, since we are looking for the closest points, we are not limited by a distance ⍴. Thus we perform a BFS of each layer of cubes from the centroid of interest. This change also led to a similar improvement to tangent plane computation time. The timing improvement for incorporating multiple processors here is shown in Figures 6-9 under the tag “k-Nearest Neighbor Search”.

It is important to note the difference in timing relevance of the Tangent Plane Approximation and the k-Nearest Neighbors Search with respect to the other stages for the different meshes (Sphere and Cube vs Bunny, Cat, and Mannequin). These two stages are only the main bottlenecks for Sphere and Cube. This is because these two meshes have fewer bins for the points than the others. Fewer bins leads to more points per bin, and thus a longer neighbor search time.

Step 3 of the algorithm requires a computation to find the distance between each vertex of a cube to the nearest centroid of a tangent plane. Initially, each cube will search for the nearest centroid for each of its eight vertices. Two techniques can be used here to improve computation time. The first improvement is to continue to use our uniform octree data structure while searching for the nearest tangent plane centroids. Secondly, we notice that each cube shares vertices with adjacent cubes (except for those on the outer layer of the quantized space). Rather than performing a search for each vertex in every cube, we have each cube perform the search computation for two vertices on an edge. Then each cube will look to an adjacent cube for the missing information. This reduces the search computation search by four times. The timing improvement of these optimizations for incorporating multiple processors is shown in Figures 6-9 under the tag “Cube Corner Distances Computed”.

Step 5 of the algorithm is an issue that occurs when computing edges. Since all cubes are worked on in parallel, this means that cubes that share faces will create duplicate vertices and edges in our final mesh. Thus, we must do a “deduplication” phase to have a valid mesh. The bottleneck in this step is identifying which vertices are unique, mapping them to an array of only unique vertices, having the edges be updated to point to these unique edges, and then ignoring redundant edges. Mapping to the unique vertex buffer has to be done serially (to count the number of unique vertices), but then the buffer can be created in parallel and the edges can be updated in parallel (since a mapping from old index to new index is created during the serial phase). To parallelize only allowing unique edges, locks are placed on the lower vertex of the edge while it attempts to add to the output mesh; this allows for atomic updating of an adjacency matrix. To find where you are placing the new unique edge, a variable is atomically incremented for each insertion. Figures 6-9 show the breakdown of each step of our algorithm; this deduplication phase is one of the longer phases for larger point clouds because of the searching for uniqueness that is required to handle the deduplication. A future step could be to decrease the amount of duplication by having neighbors communicate more.

Deduplication causes no difference in visualization, however, it can significantly increase the size of the output file. The following table shows the percent of points that were duplicated and removed for the tested inputs.

Shape Eliminated Redundancies (%)
Mannequin 86.4
Bunny 58.0
Sphere 86.8
Cube 81.25
Cat 75.9

Experimental Setup
Our program takes in a mesh and a ⍴ value as input, but the ⍴-value is important. Different ⍴-values lead to different amount of details in the mesh. ⍴-values are dependent on the sparseness and scale of the point cloud. These point clouds were run on the lateday servers, and the output meshes were visualized via Blender, an open source 3D graphics and animation software. We also using the C++ linear algebra library, Eigen, for vector data structures and PCA computation.

Results

Timing Results

bunny Speedup cube Speedup mannequin Speedup bunny detailed Speedup cube detailed Speedup mannequin detailed Speedup

Mesh Results

sphere Mesh cube Mesh
bunny cat Mesh mannequin Mesh

Surpirses
We were originally really looking forward to tackling the MST parallelization to the best of our ability. However, timing statements showed that the MST generation and normal propagation were roughly 5% of the total runtime. This only really becomes a problem on very large meshes (30,000+ points). Because of this, we opted to focus more on the parallelization of nearest neighbor searches.

Conclusion and Future Work
In general, our algorithm proved successful on several different meshes. Moreover, optimizations and parallelization steps reduced runtime considerably. There are, however, still areas to improve our current implementation. A next step would be to optimize the mesh itself. Currently there is no logic to how long edges are or the degree of each vertex. Optimizing this would lead to a much smoother mesh upon visualization.

An improvement for reducing Tangent Plane Approximation could involve a GPU. Since a uniform octree was created, each point’s nearest neighbors could be offloaded to a GPU for the plane computations since these matrix operations should all be similar time (and thus fit for GPU’s SIMD-like ways). It would be interesting to see how this would affect runtime on large point clouds.

Another improvement that can affect overall runtime would be to attempt to simplify very dense meshes (subsampling). Removing unnecessary points will reduce the number of Neighbor Searches performed considerably.

Lastly, an improvement to our mesh generation would be to populate sparse point clouds with intermediate points (supersample). This would increase runtime, but would produce a more accurate mesh.

References

  1. Hoppe's Thesis
  2. Cases for Connecting Intersections
  3. Further Explanation of Hoppe's Thesis