The basic equations in cardiac electrophysiology are the bidomain equations describing the intercellular and the extracellular electrical potential via a system of two PDEs coupled non-linearly by a bunch of ODEs. Its difference, the trans-membrane potential, is responsible for the excitation of the heart and its steepest gradients form an excitation wavefront propagating in time. This arrival time of the wavefront can be approximated by the Eikonal equation with given heterogeneous and anisotropic velocity information.
The numerical solution of the Eikonal equation follows the fast iterative method with its application for tetrahedral meshes. The accuracy of these simulations is limited by unavailable patient specific conductivity data.
The human heart consists of various tissues with different conductivity parameters.
We group these tissues into m different classes with its individual scaling parameter gamma_k yielding to the modified Eikonal equation with material domains where the velocity information in each material domain omega_k is scaled by the parameter gamma_k. Now the activation time depends also on the scaling parameters gamma.
One chance to determine the scaling parameters suitably consists in comparing the Eikonal computed activation sequence on the heart surface with the measured ECG on the torso mapped onto this surface. The performance of the solver is crucial for the optimization performance especially when using an automatic differentiation approach. Calculating the gradient via an analytic adjoint-state method is ongoing work.
Currently we have very performant parallel Eikonal solvers which include shared memory parallelization and GPU parallelization using CUDA. A low memory footprint solver where the memory access patterns are optimized is implemented on both solvers. The efficient implementation requires a local Gray-code numbering of edges in the tetrahedron and a bunch
of bit shifts to assign the appropriate data. Numerical experiments on CPUs and GPUs show that the reduced memory footprint approach is faster by 40\% than the original implementation.
For large scale problems, the task based parallel model will run into difficulties: There might be not enough (shared) memory on a single host or on a GPU, the computing power of a single compute unit is not sufficient, or the parallel efficiency is not satisfactory. In all cases, a distributed memory model is needed. Hence a coarser decomposition of the algorithm is needed, namely a domain decomposition approach.
We present two different strategies on load balancing in CUDA in order to achieve to run the domain decomposition approach in one GPU.
The first approach maps simply one sub-domain to one thread block. Its scaling improves with an increased number of sub-domains by reducing the overall host synchronization together with the preallocation of the global memory. The second approach takes better advantage of the GPU shared memory since it shares the workload of one sub-domain between many thread blocks exploiting in this way the total shared memory space.
The domain decomposition approach is the first step towards the inter-process communication implementation where the limitation of the global memory will be overcome completely by using multiple accelerator cards and cluster computing.