ECE 587 Spring 2026

GEMM Project 3 - CUDA Optimizations



Report Due: 04/12 (Sun.), by the end of the day (Chicago time)
Late submissions will NOT be graded

I. Objective

In Project 1 and 2, we established CPU baselines for General Matrix Multiplication (GEMM) and explored optimizations using loop ordering, OpenMP, and tiling. In this project, we move to CUDA and study how to increase data reuse inside a tiled GPU kernel.


II. CUDA Development on Google Colab

Our CUDA sample code for this project is available here on Github gemm_cuda.cu. You may also use the Colab notebook here. Please use either version as you see fit.

First of all, similar to Project 1 and Project 2, please update the code to include your CWID. Without making any further change to the code, build and run the program as follows so that you can confirm CUDA works properly and your CWID appears in the output.

Google Colab provides a convenient way to develop and run CUDA code without acquiring CUDA hardware and installing CUDA software on your own computer. Please follow the steps below to setup your environment for this project.

  1. Use or create a Google account so that you can access Google Colab for free.
  2. Open the notebook and click 'Connect' at the top of the page. Make sure you are using a Python 3 runtime with a T4 GPU.

  1. Click 'Terminal' at the bottom of the page to open a terminal window. In the terminal, run 'nvidia-smi' and then 'nvcc --version' to validate CUDA availability. The CUDA version reported by nvidia-smi should be the same as or higher than that of nvcc. Otherwise, you will need to add '-arch=sm_75' to nvcc when building the programs.
    /content# nvidia-smi
    ...
    | NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
    ...
    /content# nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2025 NVIDIA Corporation
    Built on Fri_Feb_21_20:23:50_PST_2025
    Cuda compilation tools, release 12.8, V12.8.93
    Build cuda_12.8.r12.8/compiler.35583870_0
    
  2. Create a notebook cell starting with "%%writefile gemm_cuda.cu" and copy your source code inside. Running the cell will write the source code into the file specified by the first line. Now run 'ls' in the terminal to confirm that the file is there.

  1. Build the programs in the terminal and run them from there.
    /content# nvcc -O2 gemm_cuda.cu -o gemm_cuda
    ...
    /content# ./gemm_cuda 8192 32
    CWID A12345678 1774383772 N = 8192, M = 32, 256 MiB per matrix
    CWID A12345678 1774383779 gemm_tiled : min 1635.271 ms, avg 1710.232 ms, max 1856.612 ms
    CWID A12345678 1774383779 gemm_tiled_2x2 : min 1.222 ms, avg 1.224 ms, max 1.225 ms
    CWID A12345678 1774383779 sums: min -264050.719, max 0.000, failed
    
    Don't worry about the compiler warnings and the last line showing 'failed'. Your goal of this project is to make it pass as discussed in the next section.

III. CUDA Kernel Optimizations

Please review Lecture 19 and the associated notebook ece587-lec19-lec20.ipynb. The key idea is to study the tiled CUDA kernel carefully and then implement the '2x2' optimization in 'gemm_tiled_2x2' as indicated by the 'TODO' comments in the code provided.

Here are a few hints for the implementation.

Run the program with matrix sizes of 4096, 8192, 16384, and tile sizes of 8, 16, 32, 64. Collect data from the output. Feel free to structure your code and experiments as you see fit, e.g. to introduce loops in 'main()' to automate runs for different N and M values, but make sure to validate your results and collect data as needed.


IV. Project Deliverables

Submit a project report in .doc/.docs or .pdf format to Canvas before the deadline for a total of 15 points. Your project report should include the following.