An Approach to Enhance Loop Performance for Multicluster VLIW DSP Processor

Conference: ARCS 2014 - 27th International Conference on Architecture of Computing Systems
02/25/2014 - 02/28/2014 at Luebeck, Deutschland

Proceedings: ARCS 2014

Pages: 8Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Yang, Yangzhao; Gu, Naijie; Ren, Kaixin (School of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027, China)
Hu, Bingqing (School of Management, University of Science and Technology of China, Hefei, 230027, China)

Modern Very Long Instruction Word (VLIW) DSP can improve the performance by using instruction-level parallelism (ILP). The program codes are generally divided into linear codes and loops, and the loops are often the key points for program performance. Software pipelining and vectorization are the common methods for loop performance, in which the Single Instruction Multiple Data (SIMD) instructions can be efficiently used. Software pipelining can find the parallelism of the instructions in different iteration excavation cycles so as to make the instructions executed in parallel, enhancing the loop performance. To improve the efficiency of software pipelining, loop unrolling is frequently used, which have an impact on vectorization too. In the situation, the factor of loop unrolling should be considered at first. This paper presents a new heuristic optimization approach called SLUS for determining the factor of loop unrolling dynamically, so that the loop unrolling can explicitly manages communication of operands between scalar and SIMD instructions, naturally coordinates the relationship between software pipelining and vectorization, partially (or fully) inhibits the optimization when vectorization will decrease performance, and fully uses the resources and processing capacity of the DSP processor. The SLUS results in better resource utilization, and it is applied in the BWDSP compiler backend, where vectorization decisions are more amenable to cost analysis. According to the experiment, we can find that there is an average of 5˜15% performance improvement on our platform BWDSP by comparing with the traditional methods.