ECE 587 Spring 2026

GEMM Project 4 - Hardware Acceleration

Report Due: 05/03 (Sun.), by the end of the day (Chicago time)
Late submissions will NOT be graded

In this project, you will be introduced to the RISC-V ecosystem in Chipyard, and the Gemmini, a RoCC (RISC-V Custom Coprocessor) accelerator for General Matrix Multiplication (GEMM). We use simulation tools to evaluate and model system performance without the need to access an actual RISC-V chip. You will also learn the skills to navigate through complex open-source projects and to focus and work on the parts that you may be interested into.

Gemmini executes custom instructions sent by the RISC-V processor, and accelerates matrix multiplication with a systolic array. Although these instructions provide full potential to optimize computations, they are quite complicated as we will need to coordinate the data movement between the main memory (L2 cache actually) and the Gemmini scratchpad and accumulator SRAM, as well as to feed the systolic array. For this project we will focus on the helper functions defined by the Gemmini software, in particular tiled_matmul_auto that takes care of moving matrices into Gemmini, performing multiplication, and moving out the results.

II. Project Setup

We rely on the same project setup as Project 1 for VM and VS Code. Please refer to Sections II and III in Project 1 instructions for details.

Alternatively, if you are familiar with linux server administration and cloud computing, you may use a Ubuntu server instance rented from a cloud provider. Please follow the instructions below but be advised that we are not responsible for any cost incurred and we cannot provide support for any tech issues. In addition, you will be required to use our ECE 587 VM Appliance if issues persist.

Prepare your server to have at least 4 CPU cores, 8GB memory, and 100GB storage.
Pickup/Install Ubuntu 24.04 LTS x64: please make sure this exact version is used.
DO NOT use the root user directly. Create a user with a strong password (DO NOT use 'iitece' as the password). You may further disable the password and use SSH key pairs for best security.
Login again with your new username and password.

Download and execute our provisioning script as follows.

ubuntu@ubuntu24:~$ git clone https://github.com/wngjia/ece587-setup
...
Unpacking objects: 100% (5/5), 839 bytes | 839.00 KiB/s, done.
ubuntu@ubuntu24:~$ ece587-setup/setup_chipyard.sh
...
make[1]: Leaving directory '/home/ubuntu/chipyard/generators/gemmini/software/gemmini-rocc-tests/build/imagenet'

It may take from 30 minutes to more than 1-hour to finish the provisioning. You can work on something else but please keep the instance running. You also need to make sure that the script completes successfully as above.

You should be able to follow Project 1 instructions to access it via SSH in Visual Studio Code now.

III. Gemmini

Since Gemmini makes heavy use of the Chipyard framework, we will need to initialize Chipyard environment and go to the directory 'chipyard/generators/gemmini' using a terminal in VS Code. Then, let's make sure that everything works properly. Run the following commands in the terminal.

source chipyard/env.sh
cd chipyard/generators/gemmini
./scripts/run-spike.sh template
./scripts/run-verilator.sh template

The output should be similar to the following and there should be no error message.

ubuntu@ubuntu24:~$ source chipyard/env.sh 
(/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~$ cd chipyard/generators/gemmini
(/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh template
...
Input and output matrices are identical, as expected
(/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-verilator.sh template
...
Input and output matrices are identical, as expected

The above commands demonstrate the use of two tools Spike and Verilator. Spike is an ISA simulator while Verilator simulates a RTL implementation of the underlying hardware. Both tools require a RISC-V program to be simulated, and here we use a sample program called "template". You should use VS Code to browse the programs in 'chipyard/generators/gemmini/software/gemmini-rocc-tests/bareMetalC' as we will need to work with them, in paticular 'tiled_matmul_ws.c'.

To measure and report the performance of some code, one needs to access the number of cycles executed so far by the RISC-V instruction 'rdcycle', and then computes the number of cycles as the difference between two 'rdcycle' invocations before and after the code. Even better, Gemmini provides access to 'rdcycle' via the function read_cycles so that we don't need to learn writing assembly code in our C program. Check 'conv.c' if you would like to see an example for calling read_cycles. However, you should realize that since Spike does not use any knowledge of RTL implementations, its reported cycle counts should be treated with caution for performance analysis.

Quite a few of the programs demonstrate how to use tiled_matmul_auto that we are interested into. Let's focus on 'tiled_matmul_ws.c' where the weight-stationary dataflow is used. Read this file and run Spike simulation to understand the structure of the code.

... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh tiled_matmul_ws
MAT_DIM_I: 64
MAT_DIM_J: 64
MAT_DIM_K: 64
Gemmini extension configured with:
    dim = 16
Starting slow CPU matmul
Cycles taken: 2130523
Starting gemmini matmul
Cycles taken: 96

The huge number of cycles required to compute the matrix multiplication using CPU simply means Verilator simulation will take too much time to complete. On the other hand, the 96 cycles required to complete the computation in Gemmini seems too good to be true (Why? You will need to answer this in your report).

To make Verilator simulation possible, we need to turn off CPU computation by changing the line '#define CHECK_RESULT 1' into '#define CHECK_RESULT 0' in 'tiled_matmul_ws.c'. Then, it is necessary to rebuild the program as follows.

... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ pushd software/gemmini-rocc-tests; ./build.sh; popd
...
~/chipyard/generators/gemmini

Make sure to stay in 'chipyard/generators/gemmini' and use pushd/popd commands to move between these directories when necessary. Run Verilator simulation to get the actual number of cycles needed.

... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-verilator.sh tiled_matmul_ws
...

Next, you will need to modify 'tiled_matmul_ws.c' to change the sizes of the matrices to 128 by 128, build the program and run Verilator again. In particular, don't forget to run "pushd software/gemmini-rocc-tests; ./build.sh; popd" again since you have modified the program.

IV. Project Deliverables

Submit a project report in .doc/.docx or .pdf format to Canvas before the deadline for a total of 15 points. Your project report should include the following, and you may need to perform additional searches online.

(4 points) What is the size of the systolic array and how many multiply-accumulate (MAC) operations can be completed per cycle? What are the sizes of the matrices and how many MAC operations are needed to complete the matrix multiplication? Calculate the minimum number of cycles required to complete the matrix multiplication on the systolic array to explain why the cycles reported by Spike are not reasonable.
(4 points) What is the number of cycles reported by Verilator? Are there a lot of cycles not spent on MAC operations? Make a reasonable guess on where those cycles are spent.
(4 points) After modifying the matrix sizes to 128 by 128, what is the number of cycles reported by Verilator? If we define the utilization to be the ratio of the minimum number cycles to the actual number of cycles, do we get better utilization? Make a reasonable guess on why.
(3 points) Include screenshots of the outputs from Spike and Verilator for different matrix sizes.