博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Heterogeneous Parallel Programming—Week one part one
阅读量:5264 次
发布时间:2019-06-14

本文共 5737 字,大约阅读时间需要 19 分钟。

Heterogeneous Parallel Programming

Wen-mei Hwu (instructor), Gang Liao (editor) www.greenhat1016@gmail.com

Lecture 0: Course Overview

Course Overview

People

Learn how to program heterogeneous parallel computing systems and achieve

  • high performance and energy-efficiency

  • functionality and maintainability

  • scalability across future generations

Technical subjects

  • principles and patterns of parallel algorithms

  • processor architecture features and constraints

  • programming API, tools and techniques

Instructor: Wen-mei Hwu w-hwu@illinois.edu, use [Coursera] to start your e-mail subject line

Teaching Assistants: John Stratton, I-Jui (Ray) Sung, Xiao-Long Wu, Hee-Seok Kim, Liwen Chang, Nasser Anssari, Izzat El Hajj, Abdul Dakkak, Steven Wu, Tom Jablin

Contributors: David Kirk, John Stratton, Issac Gelado, John Stone, Javier Cabezas, Michael Garland

Web Resources

Website: https://www.coursera.org/course/hetero

  • Handouts and lecture slides/recordings

  • Sample textbook chapters, documentation, software resources

Web board discussions

  • Channel for electronic announcements

  • Forum for Q&A - the TAs and Professors read the board, and your classmates often have answers

Grading

  • Quizzes: 50%

  • Labs (Machine Problems): 50%

Academic Honesty

  • You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who've already taken the course is also fine.

  • Any copying of code is unacceptable

    • Includes reading someone else's code and then going off to write your own.

  • Giving/receiving help on a quiz is unacceptable

Recommended Textbook/Notes

  • D. Kirk and W. Hwu, "Programming Massively Parallel Processors -- A Hands-on Approach," Morgan Kaufman Publisher, 2010, ISBN 978-0123814722

    • We will be using an pre-public-release of the 2nd Edition, made available to Coursera students at a special discount: http://store.elsevier.com/specialOffer.jsp?offerId=EST_PROG

  • Lab assignments will have accompanying notes

  • NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia, 2011 (reference book)

ECE498AL -- ECE408/CS483 - Coursera

Tentative Schedule

Week 1 Week 4
Lecture 0: Course Overview Lecture 7: Tiled Convolution
Lecture 1: Intro to Hetero Computing Lecture 8: Reduction Trees
Lecture 2: Intro to CUDA C Lab-3: Tiled matrix multiplication
Lab-1: installation, vector addition  
Week 2 Week 5
Lecture 3: Data Parallelism Model Lecture 9: Streams and Contexts
Lecture 4: CUDA Memory Model Lecture 10: Hetero Clusters
Lab-2: simple matrix multiplication Lab 4: Tiled convolution
Week 3 Week 6
Lecture 5: Tiling and Locality Lecture 11: OpenCL, OpenACC
Lecture 6: Convolution Lecture 12: Thrust, C++AMP
Lab-3: Tiled matrix multiplication Lecture 13: Summary
  Lab 4: Tiled convolution

Lecture 1.1: Introduction to Heterogeneous Parallel Computing

Heterogeneous Parallel Computing

Use the best match for the job (heterogeneity in mobile SOC)

UIUC Blue Waters Supercomputer

Cray System & Storage cabinets >300
Compute nodes >25,000
Usable Storage Bandwidth >1 TB/s
System Memory >1.5 Petabytes
Memory per core module 4 GB
Gemin Interconnect Topology 3D Torus
Usable Storage >25 Petabytes
Peak performance >11.5 Petaflops
Number of AMD Interlogos processors >49,000
Number of AMD x86 core modules >380,000
Number of NVIDIA Kepler GPUs: >3,000

CPU and GPU have very different design philosophy

CPUs: Latency Oriented Design

  • Large caches: Convert long latency memory accesses to short latency cache accesses

  • Sophisticated control

    • Branch prediction for reduced branch latency

    • Data forwarding for reduced data latency

  • Powerful ALU

    • Reduced operation latency

GPUs: Throughput Oriented Design

  • Small caches

    • To boost memory throughput

  • Simple control

    • No branch prediction

    • No data forwarding

  • Energy efficient ALUs

    • Many, long latency but heavily pipelined for high throughput

  • Require massive number of threads to tolerate latencies

Winning Applications Use Both CPU and GPU

CPUs for sequential parts where latency matters

  • CPUs can be 10+X faster than GPUs for sequential code

GPUs for parallel parts where throughput wins

  • GPUs can be 10+X faster than CPUs for parallel code

Heterogeneous parallel computing is catching on

280 submissions to GPU Computing Gems and 90 articles included in two volumes.

  • Financial Analysis

  • Scientific Simulation

  • Engineering Simulation

  • Data Intensive Analytics

  • Medical Imaging

  • Digital Audio Processing

  • Computer Vision

  • Digital Video Processing

  • Biomedical Informatics

  • Electronic Design Automation

  • Statistical Modeling

  • Ray Tracing Rendering

  • Interactive Physics

  • Numerical Methods

Lecture 1.2: Software Cost in Heterogeneous Parallel Computing

Software Dominates System Cost

  • SW lines per chip increases at 2x/10 months

  • HW gates per chip increases at 2x/18 months

  • Future system must minimize software redevelopment

the Fig. published by IBM in 2010

Keys to Software Cost Control

  • Scalability

    • The same application runs efficiently on new generations of cores

    • The same application runs efficiently on more of the same cores

  • Portability

    • The same application runs efficiently on different types of cores

     

    • The same application runs efficiently on systems with different organizations and interfaces

Scalability and Portability

  • Performance growth with HW generations

    • Increasing number of compute units

    • Increasing number of threads

    • Increasing vector length

    • Increasing pipeline depth

    • Increasing DRAM burst size

    • Increasing number of DRAM channels

    • Increasing data movement latency

  • Portability across many different HW types

    • Multi-core CPUs vs. many-core GPUs

    • VLIW vs. SIMD vs. threading

    • Shared memory vs. distributed memory

The programming style we use in this course supports both scalability and portability through advanced tools.

转载于:https://www.cnblogs.com/greenhat/archive/2012/11/29/2795037.html

你可能感兴趣的文章
22 二叉搜索树的后序遍历序列
查看>>
position新增的css3属性之sticky
查看>>
hdu 1159 Common Subsequence(最长公共子序列,DP)
查看>>
Python的进程和线程
查看>>
Hadoop伪分布式模式部署
查看>>
Swift学习笔记:类和结构
查看>>
FusionCharts简明教程(一)---建立FusionCharts图形
查看>>
花生壳宣布网站的网址直接绑定到详细的项目——jboss版本
查看>>
problem-eclipse创建maven项目报错
查看>>
SQLserver 设置自增为显式插入
查看>>
matlab中各种取整函数
查看>>
Buy the Ticket HDU 1133 卡特兰数应用+Java大数
查看>>
Welcome to jQuery EasyUI
查看>>
ASP.NET前台代码绑定后台变量方法总结
查看>>
获取字符串中的数字、符号、中文、英文单词、字母、空格、字节、其他字符的个数...
查看>>
linux-挂载ISO安装文件
查看>>
shell 语法之 if
查看>>
text-overflow设置没反应问题解决
查看>>
Java中实现异常处理的基础知识
查看>>
SQL Server ->> 建立linked server到Azure SQL Server
查看>>