In this project, you will be introduced to the RISC-V ecosystem in Chipyard, and the Gemmini, a RoCC (RISC-V Custom Coprocessor) accelerator for General Matrix Multiplication (GEMM). We use simulation tools to evaluate and model system performance without the need to access an actual RISC-V chip. You will also learn the skills to navigate through complex open-source projects and to focus and work on the parts that you may be interested into.
Gemmini executes custom instructions sent by the RISC-V processor, and accelerates matrix multiplication with a systolic array. Although these instructions provide full potential to optimize computations, they are quite complicated as we will need to coordinate the data movement between the main memory (L2 cache actually) and the Gemmini scratchpad and accumulator SRAM, as well as to feed the systolic array. For this project we will focus on the helper functions defined by the Gemmini software, in particular tiled_matmul_auto that takes care of moving matrices into Gemmini, performing multiplication, and moving out the results.
We rely on the same project setup as Project 1 for VM and VS Code. Please refer to Sections II and III in Project 1 instructions for details.
Alternatively, if you are familiar with linux server administration and cloud computing, you may use a Ubuntu server instance rented from a cloud provider. Please follow the instructions below but be advised that we are not responsible for any cost incurred and we cannot provide support for any tech issues. In addition, you will be required to use our ECE 587 VM Appliance if issues persist.
ubuntu@ubuntu24:~$ git clone https://github.com/wngjia/ece587-setup ... Unpacking objects: 100% (5/5), 839 bytes | 839.00 KiB/s, done. ubuntu@ubuntu24:~$ ece587-setup/setup_chipyard.sh ... make[1]: Leaving directory '/home/ubuntu/chipyard/generators/gemmini/software/gemmini-rocc-tests/build/imagenet'It may take from 30 minutes to more than 1-hour to finish the provisioning. You can work on something else but please keep the instance running. You also need to make sure that the script completes successfully as above.
Since Gemmini makes heavy use of the Chipyard framework, we will need to initialize Chipyard environment and go to the directory 'chipyard/generators/gemmini' using a terminal in VS Code. Then, let's make sure that everything works properly. Run the following commands in the terminal.
The output should be similar to the following and there should be no error message.
ubuntu@ubuntu24:~$ source chipyard/env.sh (/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~$ cd chipyard/generators/gemmini (/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh template ... Input and output matrices are identical, as expected (/home/ubuntu24/chipyard/.conda-env) ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-verilator.sh template ... Input and output matrices are identical, as expected
The above commands demonstrate the use of two tools Spike and Verilator. Spike is an ISA simulator while Verilator simulates a RTL implementation of the underlying hardware. Both tools require a RISC-V program to be simulated, and here we use a sample program called "template". You should use VS Code to browse the programs in 'chipyard/generators/gemmini/software/gemmini-rocc-tests/bareMetalC' as we will need to work with them, in paticular 'tiled_matmul_ws.c'.
To measure and report the performance of some code, one needs to access the number of cycles executed so far by the RISC-V instruction 'rdcycle', and then computes the number of cycles as the difference between two 'rdcycle' invocations before and after the code. Even better, Gemmini provides access to 'rdcycle' via the function read_cycles so that we don't need to learn writing assembly code in our C program. Check 'conv.c' if you would like to see an example for calling read_cycles. However, you should realize that since Spike does not use any knowledge of RTL implementations, its reported cycle counts should be treated with caution for performance analysis.
Quite a few of the programs demonstrate how to use tiled_matmul_auto that we are interested into. Let's focus on 'tiled_matmul_ws.c' where the weight-stationary dataflow is used. Read this file and run Spike simulation to understand the structure of the code.
... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-spike.sh tiled_matmul_ws
MAT_DIM_I: 64
MAT_DIM_J: 64
MAT_DIM_K: 64
Gemmini extension configured with:
dim = 16
Starting slow CPU matmul
Cycles taken: 2130523
Starting gemmini matmul
Cycles taken: 96
The huge number of cycles required to compute the matrix multiplication using CPU simply means Verilator simulation will take too much time to complete. On the other hand, the 96 cycles required to complete the computation in Gemmini seems too good to be true (Why? You will need to answer this in your report).
To make Verilator simulation possible, we need to turn off CPU computation by changing the line '#define CHECK_RESULT 1' into '#define CHECK_RESULT 0' in 'tiled_matmul_ws.c'. Then, it is necessary to rebuild the program as follows.
... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ pushd software/gemmini-rocc-tests; ./build.sh; popd ... ~/chipyard/generators/gemminiMake sure to stay in 'chipyard/generators/gemmini' and use pushd/popd commands to move between these directories when necessary. Run Verilator simulation to get the actual number of cycles needed.
... ubuntu@ubuntu24:~/chipyard/generators/gemmini$ ./scripts/run-verilator.sh tiled_matmul_ws ...
Next, you will need to modify 'tiled_matmul_ws.c' to change the sizes of the matrices to 128 by 128, build the program and run Verilator again. In particular, don't forget to run "pushd software/gemmini-rocc-tests; ./build.sh; popd" again since you have modified the program.
Submit a project report in .doc/.docx or .pdf format to Canvas before the deadline for a total of 15 points. Your project report should include the following, and you may need to perform additional searches online.