The finite-difference-time-domain method (FDTD) is widely used for electromagnetic (EM) simulations due to its accuracy, flexibility, and simplicity. Yet the benefits provided by the FDTD technique come at the cost of increased computational time. Using NVIDIA’s Compute Unified Device Architecture (CUDA) technology, computational time can supposedly be reduced by over two orders of magnitude compared to conventional computing. In a paper from Remcom titled, “Accelerating the Finite Difference Time Domain (FDTD) Method with CUDA,” the migration from a traditional C implementation of a three-dimensional FDTD method to NVIDIA’s CUDA architecture is discussed.
The six-page paper discusses the challenges and techniques that are involved in migrating the FDTD algorithm from a traditional C implementation to a form suitable for leveraging modern graphics processor units (GPUs) through NVIDIA’s CUDA framework. With the GPU approach, thousands of threads are used simultaneously. To achieve maximum speed, special design considerations are needed. Proper understanding of CUDA can enable speed to be raised beyond two orders of magnitude over traditional central processing units (CPUs).
Although GPUs were originally used for the sole purpose of driving graphical displays, they have evolved into powerful computational devices. The Tesla C1060 GPU, for example, can yield significant performance gains over a 2.66-GHz Intel Core 2 Quad processor. The document provides additional details on the CUDA GPU, including its architecture and the several types of memory that CUDA devices offer. Functions targeted for the GPU are implemented in CUDA as kernels, which are written in a similar manner to the C programming language. The document discusses the optimizations that were implemented, which reduced memory operations to less than 14% of the original total. These optimizations, along with some others, were applied to the FDTD algorithm, which was integrated into Remcom’s XFdtd software.
A modern cellular phone design was chosen for the final test simulation, which included all major device components. Tests were performed using a single thread for the CPU baseline and all four Tesla C1060s for the GPU implementation. The test results demonstrated that the GPU implementation was consistent in achieving speeds that exceeded the CPU implementation by more than two orders of magnitude.
Remcom, Inc., 315 S. Allen St., Ste. 416, State College, PA 16801; (814) 861-1299.