“A Comprehensive Performance Analysis of HSA and OpenCL 2.0” (2016)
Summary: Since most modern systems are composed of both CPUs and GPUs capable of performing computing operations, heterogeneous programming frameworks for utilizing both components are a crucial part of the development process. Besides the widely known OpenCL and CUDA frameworks, the HSA foundation also introduced a lower level programming specification (HSA 1.0). This paper features multiple benchmarks evaluating the performance of OpenCL 2.0 and HSA 1.0 with regard to shared virtual memory (SVM), kernel launch and device communication for AMD’s APUs. One of the main results is that HSA 1.0 can reduce communication overhead between CPU and GPU through persistent kernels, which are launched on the GPU just once and remain active while potentially waiting for new data to arrive.
“Low Latency Complex Event Processing on Parallel Hardware” (2012)
Summary: One of the most expensive operations in event processing systems is the sequential pattern matching operator. Pattern matching allows the user to specify a sequence of events (e.g. first smoke (S), then rising temperature (T)) to detect a complex event (e.g. fire (F)) within a stream. The sequence can often be specified via regular expressions (e.g., F=ST) and the processing usually involves automata based on this specification, progressing the states with each new event arrival. To improve query throughput, the authors of this paper propose an alternative approach that delays processing of some events (e.g., ignoring S events until T arrives). This creates a natural batching mechanism, which is used to evaluate parts of the pattern matching process on the GPU via CUDA. In experiments, this new approach outperforms the traditional evaluation method for a larger number of states or large windows (i.e., allowed time gap between the first and last event of a pattern).