ECE734 Project Ilhyun Kim and Donghyun Baik Title: Mapping DSP algorithms to a general purpose out-of-order processor Background It is becoming popular to implement DSP algorithms on a general purpose processor. This is not only because many applications often put more emphasis on the cost for developing and maintaing DSP software rather than the lower hardware cost in mass production, but also because a commodity-part general purpose processor can easily achieve the performance required for the specific application that might be implemented only in a special purpose DSP processor. Given the trends, it becomes more important to find a way to map DSP algorithm into a general purpose microprocessor. Many existing DSP transformations try to extract the most parallelism from the algorithm to achieve high performance by mapping multiple independent computations to different processing elements. Similarly, this is also a goal of high-performance compilers that exploit ILP (instruction level parallelism) to hide the latency of operations by enabling overlap of instruction executions. At the same time, more and more general purpose microporcessors are built with out-of-order execution cores that dynamically re-order instructions so that they keep all available hardware resources running by selecting independent operations within a limited scope over instructions (i.e. instruction window). Since these efforts to extract parallelism occur on different layers (high-level language, compiled binary, and inside the processor core) independently without collaborations, we believe that many of them are wasteful, and even some of them hurt the performance because of duplicated efforts that make the code inefficient, e.g. larger instruction footprints or etc. Proposal We will study the effect of algorithm transformations and compiler optimizations implemented for DSP applications running on a general purpose out-of-order processor with and without several real-world constraints. Specifically, we are focusing on single assignment and loop unrolling among many transformation techniques since they are also popular optimizations in compilers and out-of-order processors. We first deternime the ideal instruction-level parallelism of compiled binary with/without transformation technique without any compiler optimization. We also study the effect of similar compiler optimizations to the techniques that will be applied to the high-level code with/without the transformations. Later, the binaries compiled witho/without transformations/optimizations will be run on realistic/ideal out-of-order machines and the effect on the actual performance will be measured. Based on the characterization results, we want to answer the questions as follows: Are transformation technques really critical in mapping the algorithm to a general purpose out-of-order processor? What is the best way to map an algorithm to a general purpose out-of-order processor based on the given results? The simulator will be built based on Simplescalar suite, a execution- driven simulator modeling a detailed out-of-order processor based on Alpha instruction set architecture. This project includes: implementing DSP algorithm using a high-level language such as C, augmenting the simulator to measure ILP-related numbers that we are interested in, characterizing existing implementations of DSP algorithms (e.g. mpeg, jpeg..), and massive amount of simulations varing the real-world processor constraints.