Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the Dromajo Platform #721

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

sripathi-muralitharan
Copy link
Contributor

This PR adds a platform for Dromajo - a 64-bit RISC-V emulator that acts as a host for the manycore. It replaces the x86 host and performs all the host operations. The x86 host only runs VCS, controls time, manages the DPI and forwards packets between Dromajo and the manycore.

@dpetrisko
Copy link
Collaborator

The x86 host only runs VCS, controls time, manages the DPI and forwards packets between Dromajo and the manycore.

it still does the BP program loading currently, right?

@sripathi-muralitharan
Copy link
Contributor Author

The x86 host only runs VCS, controls time, manages the DPI and forwards packets between Dromajo and the manycore.

it still does the BP program loading currently, right?

If you are asking about loading the BP ELF into DRAM, that is not implemented yet. I am directly passing in the program as a argument to Dromajo. But @drichmond said we could use the bsg_manycore_loader.cpp library file to load the binary.

@dpetrisko
Copy link
Collaborator

Got it!

@mrutt92
Copy link
Collaborator

mrutt92 commented Apr 30, 2021

The x86 host only runs VCS, controls time, manages the DPI and forwards packets between Dromajo and the manycore.

it still does the BP program loading currently, right?

If you are asking about loading the BP ELF into DRAM, that is not implemented yet. I am directly passing in the program as a argument to Dromajo. But @drichmond said we could use the bsg_manycore_loader.cpp library file to load the binary.

This would require some modifications. That loader is for a rv32 binary and has some other vanilla core specific shiz in it.

@sripathi-muralitharan
Copy link
Contributor Author

The x86 host only runs VCS, controls time, manages the DPI and forwards packets between Dromajo and the manycore.

it still does the BP program loading currently, right?

If you are asking about loading the BP ELF into DRAM, that is not implemented yet. I am directly passing in the program as a argument to Dromajo. But @drichmond said we could use the bsg_manycore_loader.cpp library file to load the binary.

This would require some modifications. That loader is for a rv32 binary and has some other vanilla core specific shiz in it.

Well, that is atleast a base to start off with rather than writing one from scratch :)

@dpetrisko
Copy link
Collaborator

Yeah I would imagine the BP loader is much simpler — our linker script has like 4 sections

@mrutt92
Copy link
Collaborator

mrutt92 commented Apr 30, 2021

It should be so trivial, even a parrot could do it.

libraries/libraries.mk Outdated Show resolved Hide resolved
libraries/libraries.mk Outdated Show resolved Hide resolved
@sripathi-muralitharan sripathi-muralitharan force-pushed the bigblade-dromajo branch 2 times, most recently from 0fe92b4 to cf7e823 Compare June 22, 2021 00:46
@sripathi-muralitharan
Copy link
Contributor Author

I think I resolved most of the issues I was facing barring the ones in discussion offline. Can we get started on the review process @dpetrisko @drichmond @mrutt92 ?

@dpetrisko
Copy link
Collaborator

dpetrisko commented Jun 26, 2021

Yes @mrutt92 @drichmond , @sripathi-muralitharan will be leaving soon so it’s important we get this buttoned up by let’s say end of 1st week of july

libraries/libraries.mk Outdated Show resolved Hide resolved
@sripathi-muralitharan sripathi-muralitharan changed the base branch from bigblade to main July 17, 2021 19:40
@sripathi-muralitharan sripathi-muralitharan changed the title [WIP] Adding the Dromajo Platform Adding the Dromajo Platform Jul 17, 2021

err |= hb_bp_read_from_mc_bridge(&fence_resp_pkt, HB_MC_FIFO_RX_RSP);
is_vacant = fence_resp_pkt.response.data;
} while ((err == HB_MC_SUCCESS) && !((credits_used == 0) && is_vacant));
Copy link
Collaborator

@drichmond drichmond Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So BP counts down to 0, not up to the credit limit (as tiles do in the manycore), is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move error checking out of the while loop condition and make them discrete. It makes the termination condition hard to think about, and makes debugging later easier. (This would solve the |= comment above, too)

Then, I would make it: (credits_used != 0) || !is_vacant

This makes it clearer that the while loop will not terminate if either condition holds. Having the ! on the outside means I have to think about it and apply boolean algebra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, BP counts down to 0 because the DPI module in the testbench returns the number of credits currently used (https://github.com/bespoke-silicon-group/bsg_manycore/blob/d7ab95c470bc3ade5dce2a18046339ea211f6a8c/testbenches/dpi/bsg_nonsynth_dpi_manycore.hpp#L116)

err = hb_bp_get_credits_used(&credits_used);
// In a real system, this function call makes no sense since we will be sending packets to the
// host on the network and are trying to check for credits to be zero and complete the fence.
// It is fine here because of the system setup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more explicit about why it's fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment. It is fine in this setup because host packets do not go over the network. While checking for TX vacancy I don't think you want to send a packet over the network to do that.

// Intercept packets that are for the host and generate appropriate responses
// TODO: Currently, these packets don't go over the network. In the real system, they will and the simulation infrastructure
// must emulate that as best as possible.
if ((dromajo_to_mc_packet.request.x_dst == HOST_X_COORD) && (dromajo_to_mc_packet.request.y_dst == HOST_Y_COORD)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future I would factor this into two separate. methods, one for sending to the manycore and one for sending to the host. The if statement is fine, but there's a lot of code in a single method for transmitting packets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the packet transmission to the manycore in the same function and moved the host packet handling to another function. Can break it down further if needed.

Comment on lines 407 to 442
pkt = reinterpret_cast<__m128i *>(&dromajo_to_mc_packet);

// Allows the DPI interface to track response FIFO capacity
bool expect_response =
(dromajo_to_mc_packet.request.op_v2 != HB_MC_PACKET_OP_REMOTE_STORE) &&
(dromajo_to_mc_packet.request.op_v2 != HB_MC_PACKET_OP_REMOTE_SW) &&
(dromajo_to_mc_packet.request.op_v2 != HB_MC_PACKET_OP_CACHE_OP);

// Attempt packet transmission
// Since we trigger a call to the transmit FIFOs only when all the 32-bit Dromajo
// FIFOs are full (i.e. a single 128-bit packet is ready), we need to wait until
// the DPI FIFOs are ready to receive before advancing to the next operation.
// This can prevent filling up of the FIFOs. However, not doing this can help in
// identifying situations that might create backpressure in actual hardware and
// provision for it.
do {
advance_time();
err = dpi->tx_req(*pkt, expect_response);
} while (err != BSG_NONSYNTH_DPI_SUCCESS &&
(err == BSG_NONSYNTH_DPI_NO_CREDITS ||
err == BSG_NONSYNTH_DPI_NO_CAPACITY ||
err == BSG_NONSYNTH_DPI_NOT_WINDOW ||
err == BSG_NONSYNTH_DPI_BUSY ||
err == BSG_NONSYNTH_DPI_NOT_READY));

if (err == BSG_NONSYNTH_DPI_SUCCESS) {
for (int i = 0;i < 4; i++)
host_to_mc_req_fifo->fifo[i].pop();
}
else {
bsg_pr_err("%s: Packet transmission failed\n", __func__);
return HB_MC_FAIL;
}
}
is_empty = mc_is_fifo_empty(type);
} while (!is_empty);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, I would make this transmit_device_packet. Then it's clear it's host OR device.

But it's much easier to understand now.

  * Adding missing copyright headers
  * Bug fixes in passing arguments from x86 land to RISC-V land
  * Compiling manycore binary first in preparation to run SPMD/CUDA-lite programs on Dromajo/BP
  * Simulator updates for argument retrieval
  * Simulator updated for correct updates during fence (hopefully)
  * Adding dromajo_transmit_device_packet()
  * Updating address map, adding hb_bp_get_?x_fifo_entries
  * Adding check for entries in Dromajo TX FIFO during fence
  * Putting in newlines where required
  * Using a new linker script
  * CRT modifications for new linker script
  * Changing DRAM base for Dromajo
  * Argument parsing hacky fix to ignore the first argument; ARGP is just ughhhh
  * Copyright headers in some files
  * Refactoring placement of some make targets
  * Updates to comments
Comment on lines +43 to +67
/*
* Reads manycore specific symbols from the program binary
* @param[out] program: Pointer to a memory location containing the address of the manycore binary
* @param[out] program_size: Pointer to the manycore binary size
* @returns HB_MC_SUCCESS
*/
static int read_symbols(unsigned char **program, size_t *program_size) {
unsigned char *address;
uint64_t size;

// Read the address from the symbol table in the binary
__asm__ __volatile__ ("la %[manycore_start_addr], manycore"
: [manycore_start_addr] "=r" (address));

// The program size is already stored at a predefined location in the binary
__asm__ __volatile__ ("la t0, manycore_size\n\t"
"ld %[manycore_size], 0(t0)"
: [manycore_size] "=r" (size));

*program_size = size;
*program = address;

return HB_MC_SUCCESS;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function contains the C code to read the symbols from the program binary. Let me know how you guys want to go ahead (i.e. leave it and make it BP-specific somehow or throw it or keep this PR open until the FS stuff is figured out)

Comment on lines +28 to +42
# This assembly file includes the manycore binary if there is one

.section .rodata.manycore

.globl manycore
.align 4
.type manycore, @object
manycore:
.incbin BSG_MANYCORE_KERNELS

.globl manycore_size
.type manycore_size, @object
.align 4
manycore_size:
.int manycore_size - manycore
Copy link
Contributor Author

@sripathi-muralitharan sripathi-muralitharan Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is used to generate another assembly file where the BSG_MANYCORE_KERNELS flow variable is replaced with its value here during compilation. One hitch with running SPMD and CUDA tests is that you have to run the *.log target directly (i.e. you can't run main.exec and then run main.exe.log). This has to do with the visibility of the BSG_MANYCORE_KERNELS variable in the flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants