GPU Parallelization, Validation, and Characterization of the Tensor Template Library

Winter, Alexander Cameron

GPU Parallelization, Validation, and Characterization of the Tensor Template Library

Files

Winter_washington_0250O_21107.pdf (2.46 MB)

Date

2020-02-04

relationships.isAuthorOf

Winter, Alexander Cameron

Abstract

Previous work has developed a tool, the Tensor Template Library (TTL), which uses variadic expression template metaprogramming to capture tensor behaviors clearly and in a manner resembling the mathematical abstraction engineers are familiar with while concealing the cumbersome looping structures, in an optimized manner. This has utility in simulating physical systems in material sci-ence via finite element modelling, but with applications in systems with large numbers of small, dense tensors. The initial work of this author was to update the TTL to operate within a graphics processing unit (GPU), build a test suite to verify those updates compiled and generated correct output in a GPU environment, and then analyze performance within a submodule of a finite element solver, the Parallel Generalized Finite Element Solver (PGFEM). Initial characterization work in a GPU environment utilizing the TTL inside a submodule of the PGFEM, the Generalized Constitutive Model (GCM), was not as performant as the raw loop implementation, nor even an MPI distributed memory solution. To determine where the problem lay within the TTL (if at all), microbenchmark tests were developed to examine distinct TTL tensor operations over varying expression categories and complexities. The microbenchmark results were contrary to those observed in the GCM and indicated the TTL was considerably faster than compiler-optimized raw loops. It did however isolate a particular class of tensor operation, tensor inner products, as a point of interest to examine the dichotomous TTL behavior. Additional microbenchmarks were developed to examine the assembly code generated by the nVidia C Compiler (NVCC). Those microbenchmarks, stripped of any potentially compounding factors that may have cast doubt on the first set of microbenchmarks, validated the previous microbenchmarking results. Analysis of the assembly indicated that, in low order tensors, near-identical assembly could be generated through manual intervention over the compiler’s optimizations, however, it revealed that the compilation pipeline of the NVCC was likely to modify template source code in non-optimal ways. Template specialization of these loop structures should resolve the problem and is currently implemented in the TTL.