Recurrent neural networks (RNNs) are widely used for natural language processing, time-series prediction, or text analysis tasks. The internal structure of RNNs inference and training in terms of data or control dependencies across their fundamental numerical kernels complicate the exploitation of model parallelism, which is the reason why just data-parallelism has been traditionally applied to accelerate RNNs.
This paper presents W-Par (Wavefront-Parallelization), a comprehensive approach for RNNs inference and training on CPUs that relies on applying model parallelism into RNNs models. We use fine-grained pipeline parallelism in terms of wavefront computations to accelerate multi-layer RNNs running on multi-core CPUs. Wavefront computations have been widely applied in many scientific computing domains like stencil kernels or dynamic programming. W-Par divides RNNs workloads across different parallel tasks by defining input and output dependencies for each RNN cell. Our experiments considering different RNNs models demonstrate that W-Par achieves up to 6.6X speed-up for RNN models inference and training in comparison to current state-of-the-art implementations on modern multi-core CPU architectures. Importantly, W-Par maximizes performance on a wide range of scenarios, including different core counts or memory hierarchy configurations, without requiring any change at the source code level.