Languages and Abstractions for High-Performance Scientific Computing (CS 598APK) Spring 2025
What | Where |
---|---|
Time/place | Tue/Thu 11:00-12:15 106B3 Engineering Hall / Catalog |
Class URL | https://bit.ly/hpcabstr-s25 |
Class recordings | Illinois Mediaspace |
Discussion forum | Discuss » |
Administrative Help | Help Desk (click "Message" on the top right) |
Calendar | View » |
Assignments
- SSH Public Key Submssion (Due: Fri Jan 31)
- Talk topic selection (Due Fri Jan 31)
- HW1 (Due: Fri Feb 12 10pm)
Schedule of Talks
Date | Slot | Who | Title | Link |
---|---|---|---|---|
February 20th, 2025 | 1 | Jacob Xianyu | Sorensen 2016: Portable inter-workgroup barrier synchronisation for GPUs | Link |
February 20th, 2025 | 2 | Ananth Madan | Baghdadi 2018: Tiramisu: A Code Optimization Framework for High Performance Systems | Link |
February 27th, 2025 | 1 | Dhruv Baronia | Kothari 2024: Hydride: A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures | Link |
February 27th, 2025 | 2 | Jiatong Li | Jääskeläinen 2015: pocl: A performance-portable OpenCL implementation | Link |
March 6th, 2025 | 1 | Krut Patel | Pharr 2012: ispc: A SPMD compiler for high-performance CPU programming | Link |
March 6th, 2025 | 2 | Addison Jordan Alvey-Blanco | Ragan-Kelley 2012: Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines | Link |
March 13th, 2025 | 1 | Relena Zhao Li | Tiwari 2009: A scalable auto-tuning framework for compiler optimization | Link |
March 13th, 2025 | 2 | |||
March 27th, 2025 | 1 | Wei-Lun Wu | Vasilache 2018: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions | Link |
March 27th, 2025 | 2 | Ryan Cheng Thammakhoune | Heinecke 2016: LIBXSMM: accelerating small matrix multiplications by runtime code generation | Link |
April 3rd, 2025 | 1 | Sharvil Garg | Osama 2023: Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU | Link |
April 3rd, 2025 | 2 | Zhan Jin | Pai 2016: A Compiler for Throughput Optimization of Graph Algorithms on GPUs | Link |
April 10th, 2025 | 1 | Chia-Lun Tsai | Bezanson 2017: Julia: A fresh approach to numerical computing | Link |
April 10th, 2025 | 2 | Nick Koskelo | Leißa 2015: Shallow Embedding of DSLs via Online Partial Evaluation | Link |
April 17th, 2025 | 1 | Jonathan Wang | Bikshandi 2006: Programming for Parallelism and Locality with Hierarchically Tiled Arrays | Link |
April 17th, 2025 | 2 | Songyuan Cui | Blelloch 1990: Prefix sums and their applications | Link |
April 24th, 2025 | 1 | Haocheng Xia | Bauer 2012: Legion: Expressing locality and independence with logical regions | Link |
April 24th, 2025 | 2 | Shawn Lin | Catanzaro 2014: A Decomposition for In-place Matrix Transposition | Link |
April 29th, 2025 | 1 | Aditya Bhosale | Edwards 2014: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns | Link |
April 29th, 2025 | 2 | Hunter Joseph Demeyer | Ben-Nun 2019: Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs | Link |
May 1st, 2025 | 1 | Po-Hao Huang | Higuchi 2019: ClPy: A NumPy-Compatible Library Accelerated with OpenCL | Link |
May 1st, 2025 | 2 | Ahan Gupta | Spampinato 2016: A Basic Linear Algebra Compiler for Structured Matrices | Link |
May 6th, 2025 | 1 | Rishabh Patrick Menezes | Luporini 2015: Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly | Link |
May 6th, 2025 | 2 | Dohyun Park | Abadi 2016: Tensorflow: a system for large-scale machine learning. | Link |
Why You Should Take this Class
Software for large-scale problems is stretched between three key requirements: high-performance, typically parallel implementation, asymptotically optimal algorithms, and often highly technical application domains. This tension contributes considerably to making HPC software difficult to write and hard to maintain. If you are faced with this problem, this class can help you find and develop possible solution approaches.
Abstractions, tools, and languages can help restore separation of concerns and ease creation and maintenance of such software. Proven approaches to this problem include domain-specific mini-languages (`DSLs'), code generation, as well as 'active' libraries.
This class begins with a quick, but thorough examination of the problem setting: What machines are we realistically confronted with now and in the foreseeable future? What are determinants of performance? How can we measure and understand performance?
From the hardware level, we will then move towards a view of abstractions: Concepts and simplifications of complex hardware realities that are sufficiently thin to allow user/developer to reason about expected performance while achieving a substantial simplification of the programming task.
We will discuss, design, and evaluate a number of such systems, with the goal of putting you in a position to
- know the available landscape of tools
- make informed choices among them for a computational task at hand
- design your own abstractions and place them in context with respect to the state of the art.
As we progress, we will examine a number of increasingly high-level program representations, ranging from instruction sets to compiler IRs, through CUDA/OpenCL/'SIMT' models, to polyhedral and other higher-level representations. Along the way, you will design toy program representations and transformations for limited-scale tasks and examine your achievable and practically achieved performance.
We will pay careful attention to semantics and correctness in the context of program representation and ultimate code generation, but we will prefer the definition of specialized or simplified semantics over extensive compiler analyses that might help prove the validity of transformations.
To complement the many excellent distributed-memory offerings in the department (e.g. CS 484, CS 554), this class focuses more on 'on-node'/shared-memory performance.
Prerequisites / What You Should Already Know
- C
- Python
- Familiarity with parallel programming: You should have measured and wondered about the performance of a code you wrote
- Having taken a class on compiler construction (such as CS 426) will be helpful, but is not required.
While this class is being offered in a CS department, it is deliberately open to graduate students in the engineering disciplines who do extensive computational work and face this range of problems every day.
Suggested Papers for Student Presentations
See this list for an idea on the focus of this class. These papers can also serve as the basis for mid-semester paper presentations.
Instructor
Course Outline
I will insert links to class material, books, and papers into this tree as time goes on.
Note: the section headings in this tree are clickable to reveal more detail.
-
Introduction
- Notes
- Notes (unfilled, with empty boxes)
- Notes (source code on Github)
- About This Class
- Why Bother with Parallel Computers?
- Lowest Accessible Abstraction: Assembly
- Architecture of an Execution Pipeline
- Architecture of a Memory System
- Shared-Memory Multiprocessors
- Demo: Assembly Reading Comprehension
- Demo: Cache Organization on Your Machine
- Demo: DGEMM Performance
- Demo: Lock Contention
- Demo: More Pipeline Performance Mysteries
- Demo: NUMA and Bandwidths
- Demo: Pipeline Performance Mysteries
- Demo: Talk Time Assignment
- Demo: Talk Topic Assignment
- Demo: Threads vs Cache
-
Machine Abstractions
- C
- OpenCL/CUDA
- Convergence, Differences in Machine Mapping
- Lower-Level Abstractions: SPIR-V, PTX
- Demo: AVX Playground
- Demo: Hello GPU
- Demo: Object Orientation vs Performance
- Demo: PTX and SASS
- Demo: Pointer Aliasing
- Demo: Register Pressure
- Demo: Ways to SIMD
-
Performance: Expectation, Experiment, Observation
- Forming Expectations of Performance
- Timing Experiments and Potential Issues
- Profiling and Observable Quantities
- Practical Tools: perf, toplev, likwid
- Demo: Forming Architectural Performance Expectations
- Demo: Using Performance Counters
-
Performance-Oriented Languages and Abstractions
- Expression Trees
- Parallel Patterns and Array Languages
- Demo: 01 Expression Trees
- Demo: 02 Traversing Trees
- Demo: 03 Defining Custom Node Types
- Demo: 04 Common Operations
- Demo: 05 Reflection in Python
- Demo: 06 Towards Execution
- Demo: Expression Templates
-
Polyhedral Representation and Transformation
- Polyhedral Model: What?
- Demo: Diamond tiling
- Demo: Operating on Presburger Sets
CAUTION!
These scribbled PDFs are an unedited reflection of what we wrote during class. They need to be viewed in the context of the class discussion that led to them. See the lecture videos for that.
If you would like actual, self-contained class notes, look in the outline above.
These scribbles are provided here to provide a record of our class discussion, to be used in perhaps the following ways:
- as a way to cross-check your own notes
- to look up a formula that you know was shown in a certain class
- to remind yourself of what exactly was covered on a given day
By continuing to read them, you acknowledge that these files are provided as supplementary material on an as-is basis.
Computing
If you have submitted your SSH key, an account has been created you on a family of machines managed by members of the scientific computing area. See the linked page for more information on access and usage.