-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【论文解读】A Note on Auto-tuning GEMM for GPUs #54
Comments
A Note on Auto-tuning GEMM for GPUs这篇文章是2009年1月12日发表的,作者是Yinan Li、Jack Dongarra、Stanimire Tomov。本文的工作是由于作者受CPU的Auto-tuning启发,在GPU上提出模板化kernel的方式来做自动化调优。 对GPU的性能调优比方GEMM,都需要一定的GPU专业知识、深入理解其架构特点,而这些往往不是能从文档上看到的。为解决这个问题,因而作者提出在当前的GPU上引入Auto-tuning。 在这之前,AMD、Intel、IBM、NVDIA都在未来微处理架构和大规模的HPC系统上做了很多工作,一般 来说大规模的HPC系统有两个组成:
在未来对高性能计算架构的这两种组件的占比上,也并不明晰,可能未来会有更多异构的硬件组成。但,新兴的体系结构推动了新算法的开发,这些算法的设计空间比以前需要的要大得多,举个例子,原本早期的autotuner仅限于BLAS,而且基于一定数量的参数如分块大小,来获取尝试的参数空间,在引入多核前也确实能在足够达到还不错的性能。但当前环境下,新硬件新架构层出,本文也是想通过GEMM Auto-tuning,来加快对GPU的新特性如双精度下的计算支持,而且也加快算法最初的设计。新硬件环境、体系结构是一方面,同一问题的不同规模大小,也需要进行相应调整,所以最好是自动调优。 Auto-tuning for CPUs在谈及GPU前,最早开始自动化调优的是CPU,全称为Automatic performance tuning (optimization),最早也是应用在计算密集型的代数计算库如ATLAS、PHiPAC等BLAS计算库,此外也有用于数字信号处理领域的FFTW库。 auto-tuning在实现方法上有两种,模型驱动优化(model-driver optimization)和经验优化(empirical optimization):
即模型驱动优化只需要O(1)时间,因为参数的确定是基于分析模型,一种自然而然的想法是结合两种个特点,即在第一阶段用分析模型来限制第二阶段经验优化的巨量参数化搜索。 除上面说的以外,auto-tuning的适应性也需要强调,即最优计算的kernel可以在安装的时候被生成,即平台适应性,这个过程可以在软件安装过程获取硬件信息,从而在安装时根据这些信息来做相应的尝试与编译,安装完成及得到该平台下的最优kernle实现。 |
https://www.researchgate.net/profile/Stanimire-Tomov/publication/262407593_A_note_on_auto-tuning_GEMM_for_GPUs/links/0c9605283fb55e1dd0000000/A-note-on-auto-tuning-GEMM-for-GPUs.pdf
A Note on Auto-tuning GEMM for GPUs
作者:Y Li,J Dongarra,S Tomov
摘要:The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).
关键词:Auto-tuning matrix multiply dense linear algebra GPUs
The text was updated successfully, but these errors were encountered: