Languages and Abstractions for High-Performance Scientific Computing (CS 598APK) Spring 2025

What	Where
Time/place	Tue/Thu 11:00-12:15 106B3 Engineering Hall / Catalog
Class URL	https://bit.ly/hpcabstr-s25
Class recordings	Illinois Mediaspace
Discussion forum	Discuss »
Administrative Help	Help Desk (click "Message" on the top right)
Calendar	View »

Assignments

Homework

SSH Public Key Submssion (Due: Fri Jan 31)
HW1 (Due: Fri Feb 14 10pm) (Scheduler contest standings)
HW2 (Due: Fri Mar 7 10pm) (Hong/Kung contest standings)
HW3 (Due: Tue Apr 22 10pm) (GPU convolution contest standings)

Paper Presentation

Talk topic selection (Due Fri Jan 31)
Paper Presentation: Slides Submission

Paper Responses

(generally due Monday after the presentations)

Paper Responses 1 (Due: Fri Mar 7 10pm)
Paper Responses 2
Paper Responses 3 (Due: Fri Mar 21 10pm, copy-paste bug)
Paper Responses 4
Paper Responses 5
Paper Responses 6
Paper Responses 7
Paper Responses 8
Paper Responses 9
Paper Responses 10
Paper Responses 11
Paper Responses 12

Project

Project Proposal (Due March 26)
Poster Template (Poster session May 13 11-12:15pm)
- Siebel Poster Printing (scroll down to "Poster Printing")
Project Submission (Due May 16)

Schedule of Talks

Upload talk slides to classroom PC »

Date	Slot	Who	Title	Paper
February 20	1	Jacob Xianyu	Sorensen 2016: Portable inter-workgroup barrier synchronisation for GPUs	Link
February 25	1	Ananth Madan	Baghdadi 2018: Tiramisu: A Code Optimization Framework for High Performance Systems	Link
February 27	1	Dhruv Baronia	Kothari 2024: Hydride: A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures	Link
February 27	2	Jiatong Li	Jääskeläinen 2015: pocl: A performance-portable OpenCL implementation	Link
March 06	1	Krut Patel	Pharr 2012: ispc: A SPMD compiler for high-performance CPU programming	Link
March 06	2	Addison Jordan Alvey-Blanco	Ragan-Kelley 2012: Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines	Link
March 11	1	Ananth Madan	Baghdadi 2018: Tiramisu: A Code Optimization Framework for High Performance Systems	Link
March 13	1	Relena Zhao Li	Tiwari 2009: A scalable auto-tuning framework for compiler optimization	Link
March 13	2
March 27	1	Wei-Lun Wu	Vasilache 2018: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions	Link
March 27	2
April 03	1	Zhan Jin	Pai 2016: A Compiler for Throughput Optimization of Graph Algorithms on GPUs	Link
April 03	2
April 10	1	Nick Koskelo	Leißa 2015: Shallow Embedding of DSLs via Online Partial Evaluation	Link
April 10	2	Chia-Lun Tsai	Bezanson 2017: Julia: A fresh approach to numerical computing	Link
April 17	1	Jonathan Wang	Bikshandi 2006: Programming for Parallelism and Locality with Hierarchically Tiled Arrays	Link
April 17	2	Sharvil Garg	Osama 2023: Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU	Link
April 29	1	Aditya Bhosale	Edwards 2014: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns	Link
April 29	2	Hunter Joseph Demeyer	Ben-Nun 2019: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs	Link
April 29	3	Haocheng Xia	Bauer 2012: Legion: Expressing locality and independence with logical regions	Link
May 01	1	Po-Hao Huang	Higuchi 2019: ClPy: A NumPy-Compatible Library Accelerated with OpenCL	Link
May 01	2	Ahan Gupta	Spampinato 2016: A Basic Linear Algebra Compiler for Structured Matrices	Link
May 01	3	Shawn Lin	Catanzaro 2014: A Decomposition for In-place Matrix Transposition	Link
May 06	1	Rishabh Patrick Menezes	Luporini 2015: Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly	Link
May 06	2	Dohyun Park	Abadi 2016: Tensorflow: a system for large-scale machine learning.	Link

Why You Should Take this Class

Visualization of an array access pattern

GPU kernel source code

Software for large-scale problems is stretched between three key requirements: high-performance, typically parallel implementation, asymptotically optimal algorithms, and often highly technical application domains. This tension contributes considerably to making HPC software difficult to write and hard to maintain. If you are faced with this problem, this class can help you find and develop possible solution approaches.

Abstractions, tools, and languages can help restore separation of concerns and ease creation and maintenance of such software. Proven approaches to this problem include domain-specific mini-languages (`DSLs'), code generation, as well as 'active' libraries.

This class begins with a quick, but thorough examination of the problem setting: What machines are we realistically confronted with now and in the foreseeable future? What are determinants of performance? How can we measure and understand performance?

From the hardware level, we will then move towards a view of abstractions: Concepts and simplifications of complex hardware realities that are sufficiently thin to allow user/developer to reason about expected performance while achieving a substantial simplification of the programming task.

We will discuss, design, and evaluate a number of such systems, with the goal of putting you in a position to

know the available landscape of tools
make informed choices among them for a computational task at hand
design your own abstractions and place them in context with respect to the state of the art.

As we progress, we will examine a number of increasingly high-level program representations, ranging from instruction sets to compiler IRs, through CUDA/OpenCL/'SIMT' models, to polyhedral and other higher-level representations. Along the way, you will design toy program representations and transformations for limited-scale tasks and examine your achievable and practically achieved performance.

We will pay careful attention to semantics and correctness in the context of program representation and ultimate code generation, but we will prefer the definition of specialized or simplified semantics over extensive compiler analyses that might help prove the validity of transformations.

To complement the many excellent distributed-memory offerings in the department (e.g. CS 484, CS 554), this class focuses more on 'on-node'/shared-memory performance.

Prerequisites / What You Should Already Know

C
Python
Familiarity with parallel programming: You should have measured and wondered about the performance of a code you wrote
Having taken a class on compiler construction (such as CS 426) will be helpful, but is not required.

While this class is being offered in a CS department, it is deliberately open to graduate students in the engineering disciplines who do extensive computational work and face this range of problems every day.

Instructor

Andreas Kloeckner

Andreas Kloeckner

(Instructor)

Email: andreask@illinois.edu

Office: 4318 Siebel

Course Outline

I will insert links to class material, books, and papers into this tree as time goes on.

Note: the section headings in this tree are clickable to reveal more detail.

Introduction
- Notes
- Notes (unfilled, with empty boxes)
- Notes (source code on Github)
- About This Class
- Why Bother with Parallel Computers?
- Lowest Accessible Abstraction: Assembly
- Architecture of an Execution Pipeline
- Architecture of a Memory System
- Shared-Memory Multiprocessors
- Demo: Assembly Reading Comprehension
- Demo: Cache Organization on Your Machine
- Demo: DGEMM Performance
- Demo: Lock Contention
- Demo: More Pipeline Performance Mysteries
- Demo: NUMA and Bandwidths
- Demo: Pipeline Performance Mysteries
- Demo: Talk Time Assignment
- Demo: Talk Topic Assignment
- Demo: Threads vs Cache
Machine Abstractions
- C
- OpenCL/CUDA
- Convergence, Differences in Machine Mapping
- Lower-Level Abstractions: SPIR-V, PTX
- Demo: AVX Playground
- Demo: Hello GPU
- Demo: Object Orientation vs Performance
- Demo: PTX and SASS
- Demo: Pointer Aliasing
- Demo: Register Pressure
- Demo: Ways to SIMD
Performance: Expectation, Experiment, Observation
- Forming Expectations of Performance
- Timing Experiments and Potential Issues
- Profiling and Observable Quantities
- Practical Tools: perf, toplev, likwid
- Demo: Forming Architectural Performance Expectations
- Demo: Using Performance Counters
Performance-Oriented Languages and Abstractions
Polyhedral Representation and Transformation

CAUTION!

These scribbled PDFs are an unedited reflection of what we wrote during class. They need to be viewed in the context of the class discussion that led to them. See the lecture videos for that.

If you would like actual, self-contained class notes, look in the outline above.

These scribbles are provided here to provide a record of our class discussion, to be used in perhaps the following ways:

as a way to cross-check your own notes
to look up a formula that you know was shown in a certain class
to remind yourself of what exactly was covered on a given day

By continuing to read them, you acknowledge that these files are provided as supplementary material on an as-is basis.

Computing

If you have submitted your SSH key, an account has been created you on a family of machines managed by members of the scientific computing area. See the linked page for more information on access and usage.

Grading Policies

View grading policies »