Note for examiner
- folder rtl contains failed attempt for implementing RV321 for our own machine code
- folder rtl2 contains successful files for implementing single-cycle RV321 for our own machine code
- folder pipelining contains successful files for pipelining for our own machine code
- folder rtl3 contains successful files for implementing single-cycle RV321 for the reference program
- folder pipelining2 contains successful files for pipelining for the reference program
Within test folder:
- Machine code for f1 and reference programs for single-cycle and pipelining can be found in the machineCode folder
- Results for f1 and reference programs for single-cycle and pipelining can be found in the results folder
The team decided to make commits directly to the main branch of the repo instead of creating individual branches for several reasons:
- Simplicity and convenience of operating with only one branch, without having to worry about versioning.
- Easier collaboration and coordination, as it is easy to see what other team members are working on and recent changes.
- Using one branch allows for rapid iteration and deployment, which speeds up processes such as debugging.
- Lastly, using one branch allows for flexibility and adaptability to changes such as new instructions.
To facilitate concurrent work on different versions of the CPU, we created multiple folders. This also helped to differentiate between the different CPU features (e.g. pipelined vs not pipelined).
To avoid conflicts in versions and issues with pushing and pulling, we notified other team members when we were working on a folder. Additionally, we frequently met in person to work on the project together, and in most cases, changes were pushed from a single team member's computer (usually Anish's or Pengyuan's).
The drawback of this approach may be that the work might be incorrectly be attributed to a different team member (due to commit history). To resolve this, we added an excel table to track each teammate's task and contribution. There are also further explanations embedded throughout this document, and the personal statements will include detailed explanations of each individual's roles.
However, note that since the team members frequently met up to work together, and to further avoid issues with git, some changes and files would have been pushed from another team members' computer.
- Planning Single-Cycle RV321 Design: We looked at our working design of Lab 4 and made note of differences in design to the final project. We then wrote down all the new changes we needed to make and assigned it.
- Implementing RV321 Design (For our own machine code): We then documented the changes we made, each of us contributing to the writeup of the task we completed.
- Implementing Pipelining (For our own machine code): Made initial changes (and planned diagram), added 4 flip-flops, debugged and tested for our own machine code.
- Implementing RV321 Design (For Reference Program)
- Implementing Pipelining (For Reference Program)
- Implementing Caching
The planning section was done together as a group.
Looking at the brief and comparing our Lab 4 design with the required project design, we have the following requirements:
- Coming up with a machine code
- Add Jump Multiplexer: with inputs PC+4 and Result which outputs WD3 (write data of register file). This will be used to write the return address after a jump to the register file, when a jump takes place
- Add Trigger Multiplexer: multiplexer with select line based on trigger
- Add Return Multiplexer
- Changes in Control Unit: for implementing JAL, Load and Store
- Adding Data Memory and Multiplexer
For this stage of the project, we are all committing directly to the main branch. Since we are all working on individual modules, there won't be conflicts.
We delegated the tasks modularly:
- Creating Machine Code
- ALU: adding load immediate instruction into ALU
- Control Unit: implementing control signals for JAl and Load/Store instructions plus jump and return multiplexers
- Data Memory and Trigger Multiplexer: created data memory module and trigger multiplexer
- Top-level module checks and Testing: ensuring variable names are consistent, debugging, simulating on GTK wave and checking that machine code works
Working together, we devised the following machine code:
Understanding value in s10 (affected by TRIGGERSEL):
Visualising the code:
EDIT: having debugged, we now find that adding the Trigger Multiplexer causes issues with our load/store instructions. As we can see from the diagram, when the button has not been pressed, TRIGGERSEL = 0, bypassing the entire data memory. To quickly ensure that our architecture still works when excluding the Trigger, we swapped the connection is the trigger multiplexer and used the following machine code:
FINAL NOTE: All the images of our altered diagram should have the Trigger Multiplexer's connections swapped.
Here is the work done to implement the instructions for ALU:
graph note: X represents don't care.
TRIGGER MUX IS NOT USED IN FINAL VERSION
R-type instruction should do some ALU operation with rs1 and Sign-extended Immediate and store it into the destination register rd. Next cycle = PC + 4.
The following graph shows how the control unit signals controls the entire program to perform I-type instruction. (Orange Line)
Load word instruction should load the value of (memory address: [value in register file (rs1) + Immediate]) into destination register rd. Next cycle = PC + 4.
The following graph shows how the control unit signals controls the entire program to perform load word. (Orange Line)
Store word instruction should store the value of register file rs2 to (memory address: [value in register file (rs1) + Immediate]). Next cycle = PC + 4.
The following graph shows how the control unit signals controls the entire program to perform store load. (Orange Line)
R-type instruction should do some ALU operation with rs1 and rs2 and store it into the destination register rd. Next cycle = PC + 4.
The following graph shows how the control unit signals controls the entire program to perform R-type instruction. (Orange Line)
Branch Equal instruction should compare two register value rs1 and rs2. If they are equal, next cycle jumps to PC + Immediate.
The following graph shows how the control unit signals controls the entire program to perform BEQ. (Equal ? Orange Line : Purple Line)
Jump and Link instruction should obtain two goals.
- The counter should jump to register file rs1 + immediate, normally (return address value ra + Immediate).
- return address register ra (0x01) should store the address of the next cycle (current PC + 4)
The following graph shows how the control unit signals controls the entire program to perform jump and link register. Orange line performs the first goal. Purple line performs the second goal.
Jump and Link instruction should obtain two goals.
- The counter should jump to current PC value + Immediate
- return address register ra (0x01) should store the address of the next cycle (current PC + 4)
The following graph shows how the control unit signals controls the entire program to perform jump and link. Orange line performs the first goal. Purple line performs the second goal.
Load Upper Immediate instruction should load the {20 bit immediate, 12'b0} into destination register rd. Next cycle = PC + 4.
The following graph shows how the control unit signals controls the entire program to perform LUI. (Orange Line)
To implement this part of the diagram:
I created the data memory module:
And made the relevant changes to the top-level module:
Changing number of addresses in instruction memory and data memory:
This tells us:
-
Data memory is (00000 to 1FFFF) = 131072 addresses = 2^17 addresses
-
Instruction memory is (000 to FFF) = 4096 addresses = 2^12 addresses
While debugging we identified the following issues with our code:
- Some minor syntax errors
- For Jump/Branch type instructions the concatenation in the sign-extend module was incorrect
- We had some inconsistencies with bit widths between the top-level module and the control unit
We have successfully implemented RV321 with our own program. All the files for this are in folder rtl2
-
In order to implement the first flip-flop (connected to instruction memory), we had to change our architecture to no longer have a top-level module for the program counter components. File PC_top has to be removed. Instead we add that code directly to the top level module:
-
Then we realised there is a disparity between our control signals and the diagram provided. Thus we also changed our control unit.
- The control unit in single cycle version controls PCsrc directly. However, it is now controlled by both branch and jump logic. This is due to the zero signal cannot feedback to the control unit as they are now excuting asynchronously. Thus, we add two new logic of Branch and Jump.
- To perfrom jump instruction, we use MUXJUMP and JUMPRT in previous version. In this new version, we keep the two control signals and make them go through all flipflops because they are used in the last stage of the machine.
- The diagram provided can only implement branch eqaul. In our new version, we add another control signal called BranchMUX to controls PCsrc signal to be 1 or zero depending on funct3 (beq or bne).
-
Finally we realised our architecture had the "Return multiplexer", which connects Result to PC Target. With this new pipelined version, we realised PCTarget needs to be extended through two flip-flops to get PCTargetW.
-
We settled on this diagram, to implement pipelining:
For the 4 Flip-Flops, we create four separate modules for each of them. Each have a bunch of inputs and outputs following the architecture. Then, there is a always_ff @(posedge clk) that upadtes output with input at positive clock edge. A demonstration of one of the flip-flop is given below.
To make the top-level
-
We added all the new internal signals, for example:
-
We also added the new modules and matched them with the relevant signals, for example:
-
We tried the following machine code
- ALU and Load/Store
- Branch
- Jumps
While testing we uncovered the following errors
- We had not updated the select lines in the multiplexers to the delayed signals
- Our machine code had 0x10 immediate when it should have been 0xa
- Our machine code did not have NOPs
- We had not updated the inputs to modules like ALU and Data Memory with the new input signals
- Our original diagram was incorrect. Branch and Jump both require PCTarget, but Branch uses PCTargetE while Jump uses PCTargetW. This caused issues. Thus we changed out diagram to:
And our code worked!
To implement the reference program, the only new instruction we need to be able to implement is load byte unsigned (LBU) and store byte (SB).
To do so we needed to make changes to our data memory. We needed to:
- Add a new control signal which determines whether the instruction requires word or byte addressing
- Change the widths stored in each Data Memory element from 32 to 8 bits. Therefore, 4 addresses make up one 32-bit word.
Implementation of Byte Addressing in Data Memory
-
We made changes in our control unit to identify which address mode the load/store is in (example for load):
-
We also added addr_mode to data memory and top-level module
-
We then changed our DataMemory.sv module to implement byte addressing:
-
We also pre-loaded Data Memory with data_array.
-
Testing with the following machine code worked successfully. 1F was outputted in the VBuddy display
Our final diagram:
The machine code for the reference program is almost the same as what is provided. However, we discover a issue in the initialization stage where the line decrement a1 should be addi a1,a1,1. In addition, we change RET to JALR ra,ra becuase RET instruction mess up with ZERO register. The machine code adjusted is shown below:
Things Learned/Errors Fixed whilst Debugging
-
Our main error was not having begin and end statements within if-else. This made it so that the conditions were being met when they weren't supposed to
-
We learned using the $display() function is useful for displaying the value in RegFile or DataMemory at a specific address
-
We learned that using the following statements solved our "bits of [blank] not used" warning (here, [blank] is instr):
For pipelining, we also had to implement Byte addressing in Data Memory. The process was virtually the same as single-cycle (except we had to remember to carry addr_mode across two flip-flops. Adding NOPs to the earlier test machine code yielded successful results. Our final diagram:
The machine code for pipelining reference program is similar to single cycle one, but we add 5 nops between each instructions. The a0 output from display is now horizontally expanded due to nops.
The same errors from 4. were carried onto 5. which we fixed.
We design our data cache system for spatial locality. C = 8 words Block size: 4 Blocks needed: 2 To accomodate temporal locality, the current used value will be stored in data cache.
2/8 = 25% for each subroutine Both misses are compulsory misses due to first time fetch.
We first check if there is desired cached data in our desired memory address, if so, load cached data into register file. If not, we store the current value along with 3 other spatially related values in a set in data cache.
To implement data caching in SystemVerilog, We would first need to design the cache memory itself. This would involve deciding on the number of sets and the number of blocks in each set, as well as the size of each block in terms of words. We chose to use direct-mapping with 1 block per set, where the block size is 4 and there are 4 sets. In total it would store 16 words.
Next, I would need to implement the logic for storing and retrieving data from the cache. To store data in the cache, I would first need to determine which set and block the data should be stored in, based on the memory address of the data. If the desired set and block are currently occupied, I would need to implement a replacement policy to decide which data to evict from the cache in order to make room for the new data. Once the data has been stored in the cache, the corresponding memory address would also need to be updated in the cache's tag array to keep track of which data is currently stored in the cache.
To retrieve data from the cache, I would need to use the memory address of the data to determine which set and block the data should be in. If the data is not found in the cache, a cache miss would occur, and the data would need to be fetched from main memory and stored in the cache. If the data is found in the cache, a cache hit would occur, and the data can be quickly retrieved from the cache.