ECE 587 Spring 2026

GEMM Project 3 - CUDA Optimizations

Report Due: 04/12 (Sun.), by the end of the day (Chicago time)
Late submissions will NOT be graded

In Project 1 and 2, we established CPU baselines for General Matrix Multiplication (GEMM) and explored optimizations using loop ordering, OpenMP, and tiling. In this project, we move to CUDA and study how to increase data reuse inside a tiled GPU kernel.

II. CUDA Development on Google Colab

Our CUDA sample code for this project is available here on Github gemm_cuda.cu. You may also use the Colab notebook here. Please use either version as you see fit.

First of all, similar to Project 1 and Project 2, please update the code to include your CWID. Without making any further change to the code, build and run the program as follows so that you can confirm CUDA works properly and your CWID appears in the output.

Google Colab provides a convenient way to develop and run CUDA code without acquiring CUDA hardware and installing CUDA software on your own computer. Please follow the steps below to setup your environment for this project.

Use or create a Google account so that you can access Google Colab for free.
Open the notebook and click 'Connect' at the top of the page. Make sure you are using a Python 3 runtime with a T4 GPU.

Click 'Terminal' at the bottom of the page to open a terminal window. In the terminal, run 'nvidia-smi' and then 'nvcc --version' to validate CUDA availability. The CUDA version reported by nvidia-smi should be the same as or higher than that of nvcc. Otherwise, you will need to add '-arch=sm_75' to nvcc when building the programs.

/content# nvidia-smi
...
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
...
/content# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Create a notebook cell starting with "%%writefile gemm_cuda.cu" and copy your source code inside. Running the cell will write the source code into the file specified by the first line. Now run 'ls' in the terminal to confirm that the file is there.

Build the programs in the terminal and run them from there.

/content# nvcc -O2 gemm_cuda.cu -o gemm_cuda
...
/content# ./gemm_cuda 8192 32
CWID A12345678 1774383772 N = 8192, M = 32, 256 MiB per matrix
CWID A12345678 1774383779 gemm_tiled : min 1635.271 ms, avg 1710.232 ms, max 1856.612 ms
CWID A12345678 1774383779 gemm_tiled_2x2 : min 1.222 ms, avg 1.224 ms, max 1.225 ms
CWID A12345678 1774383779 sums: min -264050.719, max 0.000, failed

Don't worry about the compiler warnings and the last line showing 'failed'. Your goal of this project is to make it pass as discussed in the next section.

III. CUDA Kernel Optimizations

Please review Lecture 19 and the associated notebook ece587-lec19-lec20.ipynb. The key idea is to study the tiled CUDA kernel carefully and then implement the '2x2' optimization in 'gemm_tiled_2x2' as indicated by the 'TODO' comments in the code provided.

Here are a few hints for the implementation.

You are not supposed to change the kernel launching code.
Each block of M by M threads now works on a 2M by 2M tile of the matrices A, B, and C.
Each thread loads 4 elements from A and 4 from B, and the shared memory has a size of 8*M*M, i.e. a 2M by 2M tile for each of A and B.
Each thread computes 4 elements of C and keeps them in 4 registers.
In the inner loop, each thread reads 2 elements a0 and a1 from tile A and 2 elements b0 and b1 from tile B into 4 registers, and then updates the 4 elements of C with a0*b0, a0*b1, a1*b0, and a1*b1.

Run the program with matrix sizes of 4096, 8192, 16384, and tile sizes of 8, 16, 32, 64. Collect data from the output. Feel free to structure your code and experiments as you see fit, e.g. to introduce loops in 'main()' to automate runs for different N and M values, but make sure to validate your results and collect data as needed.

IV. Project Deliverables

Submit a project report in .doc/.docs or .pdf format to Canvas before the deadline for a total of 15 points. Your project report should include the following.

(5 points) Show your implementation of 'gemm_tiled_2x2'. Explain what changed compared to 'gemm_tiled_1x2' from Lecture 19 and discuss any difficulty you met. If you use an AI assistant to generate the code, discuss the prompts used.
(7 points) Organize the results by tables to compare running times and GFLOPS between 'gemm_tiled' and 'gemm_tiled_2x2'. Visualize the trends and discuss the results. Discuss whether the 2x2 version improves performance and why. In particular, are you expecting 'gemm_tiled' to have better performance compared to itself when the tile size increases from 32 to 64? What really happens? How is this related to 'gemm_tiled_2x2' with a tile size of 32?
(3 points) Include screenshots of your program outputs for different matrix and tile sizes. Make sure your CWID is clearly shown in the screenshots.