Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

Open
108anup opened this issue Jul 29, 2022 · 13 comments
Labels
question Question in regards the design or implementation

Comments

@108anup
Copy link
Contributor

108anup commented Jul 29, 2022

I am not sure what is the best place to ask this question, hence asking here as an issue. Please let me know if some other place is preferred over this.

Q1: I get cases where observed throughput is higher than theoretical maximum, e.g., 97 Gbps throughput with 1472B payload where the theoretical is around 95 Gbps. 2 Gbps is large enough difference that I can't explain using a few cycle measurement mismatch. I noticed the recently changed juptyer notebook also shows this but the mismatch is only 0.0005 Gbps or so. What might be the reasons for higher than theoretical throughput measurement. If we under measure the time to receive packets by even 100 cycles, that may explain 0.0005 Gbps mismatch but not 2 Gbps mismatch.

Q2. I observe relatively lower throughput with 128B payload. Why might that be? Is it to do with bytes transmitted per flit? If so, 64B payload then should also have similar issue (55 bytes per flit). But we see that only 128B payload has ~5 Gbps gap to theoretical maximum, while other payload sizes don't have much gap.

Calculation:
Flit is AXI transfer data width (64B). For 128B payload, frame size = 128 + UDP (8), IP (20), Ethernet(14) and FCS (4) = 174B, requires 3 flits (3 * 64B >= 174B). Or 174/3 bytes sent per flit = 58.
Similarly for 64B payload, bytes per flit is 55. For other payload sizes bytes per flit is >= 60.

@mariodruiz
Copy link
Collaborator

Hi,

  1. Under what conditions are you getting this large mismatch? The CMAC should limit the maximum tx/rx bandwidth. However, if the burst is short, the FIFOs in the datapath can absorb this difference.
  2. There's an overhead in the HLS IP which is impacts significantly segments in the range of 65 to 160 bytes.

In the throughput notebooks the overhead is shown:

# overhead is UDP (8), IP (20), Ethernet(14), FCS (4), IFG (12), preamble (7), start frame delimiter (1)
overhead = 8 + 20 + 14 + 4 + 12 + 7 + 1

@mariodruiz mariodruiz added the question Question in regards the design or implementation label Aug 1, 2022
@108anup
Copy link
Contributor Author

108anup commented Aug 8, 2022

  1. I am using the provided notebook. I change MAC/IP address configuration and kernel configuration. I don't make any changes to the FPGA code, i.e., use vanilla xclbin. The TX/RX measurement is done on a single FPGA instead of 2 different FPGAs. The benchmarking FPGA sends packets to a device-under-test (DUT, running open NIC). The DUT forwards packets back to the benchmarking FPGA. The benchmarking FPGA has 2 kernels started (one in PRODUCER and other in CONSUMER mode). The PRODUCER kernel continuously sends packets (not in burst).
    1.1. For 64B UDP payload (174B frame or packet), the TX throughput itself is around 1.163 Gbps above theoretical. This difference is at the frame level. At the application (payload), the difference is 0.67 Gbps above theoretical.
    1.3 I.e., for 64B payload, throughput reported by notebook = 49.907 Gbps, frame_level_throughput = 85.777 Gbps. Theoretical maximum throughput = 49.231 Gbps at application level and = 84.651 Gbps at frame level.
    1.2 (frame_level_throughput = application_level_throughput * (payload + 46) / payload).
  2. 64 byte payload (frame size = 110B) should also be impacted by this overhead right. But only 128 byte payload (frame size = 174B) is impacted. From the notebook in upstream code, the throughput difference to theoretical max is 4.6 Gbps for 128B payload (174B segment or frame) while it is 0 for 64B payload (110B segment or frame). Segment or frame refer to bits transferred in one AXI stream transaction.
    3.1. From what I understand, based on the reasoning you give, the impact should be witnessed by 64B payload not 128B payload as 64B payload creates segments of size 110B which lie in the range 65 to 160 bytes.

@108anup
Copy link
Contributor Author

108anup commented Aug 8, 2022

To double check, I can run the vanilla un-modified VNx notebook to see if the difference is still there. Since the difference is in TX throughput itself, I would expect the same difference to still be there as TX should not be affected by DUT much. Increase in RX throughput could be maybe from packet duplication (though I have no reason to believe that packets would be duplicated).

@mariodruiz
Copy link
Collaborator

1.1. For 64B UDP payload (174B frame or packet)

How do you get the 174B in here?

Each of the individual IP that compose the network layer needs one or two clock cycles extra to process each packet/segment. I suppose that for 128-Byte the extra cycles stack impacting the throughput. This is something I haven't profiled, because for bulk data transfer you will not use small packet size.

The low performance for small packet is known. Given the current design, this is unavoidable.

Mario

@108anup
Copy link
Contributor Author

108anup commented Aug 9, 2022

Sorry typo there. For 64B payload (64+46=110B frame)

@108anup
Copy link
Contributor Author

108anup commented Aug 9, 2022

I understand that performance will be low for small packets. The thing I don't understand is why performance for 64B payload is better than performance for 128B payload. I get better throughput with 64B payloads.

Throughput I measure for payload level using provided notebook is
49.907 Gbps for 64B and 60.21 Gbps for 128B. If we translate these to frame level, using frame_level_throughput = payload_level_throughput * (payload size + header size) / (payload size), we get throughput as 85.77 Gbps for 64B payload (110B frame) and 81.84 Gbps for 128B payload, i.e., packet level throughput is higher for 64B payloads.

Even in the provided notebook, the throughput with 128B payloads is 5 Gbps lower than theoretical while for 64B payloads, it is close to theoretical, i.e., the efficiency for 64B payloads (smaller payload) is better than that with 128B payloads (larger payload).

@108anup
Copy link
Contributor Author

108anup commented Aug 9, 2022

Also any thoughts on why TX throughput might be higher than theoretical?

@mariodruiz
Copy link
Collaborator

This is how I compute these throughputs

udp = 8
ip = 20
eth = 14
fcs = 4
ifg = 12
pp_amble = 8

def thr(payload_size: int):
    total_bytes = payload_size + udp + ip + eth + fcs + ifg + pp_amble
    payload_thr = payload_size / total_bytes
    frame_thr = (payload_size + ip + udp + eth) / total_bytes
    return payload_thr * 100 , frame_thr * 100.0

So, thr(64) = (49.23076923076923, 81.53846153846153) and thr(128) = (65.97938144329896, 87.62886597938144)

I think your theoretical equation does not look right, as the payload (segment) size increases, the overhead decreases (the deficiency increases).

Correct me if I am missing something.

@mariodruiz
Copy link
Collaborator

mariodruiz commented Aug 9, 2022

Someone already asked about this here, I suppose your theoretical throughput is what this person calls naked cmac, still does not mach the numbers I showed above. But, I believe the Python snippet is the correct way to compute this.

@108anup
Copy link
Contributor Author

108anup commented Aug 9, 2022

My defs are same except: frame_thr = (payload_size + ip + udp + eth + *fcs*) / total_bytes
Frame is what goes into tdata of the AXIS interfaces, i.e., payload + 46 bytes. This is also why tkeep in LATENCY kernel is 18 = 64 - 46, i.e., 18B payload gives frame of size 1 flit (TDATA WIDTH).

  1. For the small packets, the issue is not in theoretical computation. The issue is that the measured frame throughput for 128B payload is smaller than 64B payload (in my setup). Even in the upstream notebook, the measured payload throughput for 128B payload is further away from theoretical (4 Gbps away, 65 Gbps theoretical, 61 Gbps measured) than 64B payload (0 Gbps difference, measured and theoretical = 49.231 Gbps). Even here, the measured L2 bandwidth is higher for 110B packet (95.45) compared to 174B (93.84).

  2. For theoretical calculations (at least for theoretical throughput at payload level), I use exact same calculation as yours. I am seeing higher measured throughput than theoretical throughput at payload level of around 0.6 to 1.3 Gbps.

Payload Theoretical Gbps at payload level Measured Gbps at payload level Difference
64 49.23076923 49.90756385 -0.6767946191
128 65.97938144 60.21115183 5.768229612
192 74.41860465 75.26383693 -0.8452322772
256 79.50310559 80.59648509 -1.093379504
320 82.9015544 84.04169256 -1.140138154
384 85.33333333 86.50693924 -1.173605905
448 87.15953307 88.35825842 -1.198725344
512 88.58131488 89.79962376 -1.218308883
576 89.71962617 90.95361477 -1.233988597
640 90.65155807 91.89833917 -1.246781098
704 91.42857143 92.68604922 -1.257477787
768 92.08633094 93.35285902 -1.266528087
832 92.65033408 93.92464052 -1.274306441
896 93.13929314 94.42033049 -1.281037355
960 93.56725146 94.85417209 -1.286920625
1024 93.94495413 95.23706712 -1.292112994
1088 94.28076256 95.57749555 -1.296732985
1152 94.58128079 95.88214486 -1.300864075
1216 94.85179407 96.15638323 -1.304589155
1280 95.09658247 96.40453013 -1.307947665
1344 95.31914894 96.630162 -1.311013068
1408 95.52238806 96.83620318 -1.313815124
1472 95.70871261 97.02508574 -1.316373129

In above table the measured at payload level is directly reported by the notebook, i.e., from ol_w0_tg.compute_app_throughput('tx') . The theoretical is taken from what you described here, i.e., payload_thr = 100 * payload_size / (payload_size + udp + ip + eth + fcs + ifg + pp_amble).

In this table, there are 2 weird things,

  1. The behavior for 128B payload is different compared to smaller and larger payloads (difference = +5.76Gbps vs -0.67 Gbps (64B) and -0.84 Gbps (192B).
  2. There is significantly larger than theoretical throughput (e.g., 97.02 Gbps measured vs 95.708 Gbps theoretical for 1472B payload).

@mariodruiz
Copy link
Collaborator

mariodruiz commented Aug 9, 2022

How many packets are you sending for each payload size?

  1. 128-Byte is always going to give the worst efficiency. Without redesigning the whole network layer, this cannot be solved.
  2. There's a small chance that the CMAC is overclocked. This only could happen when connected to another Alveo card. You can try to run the same experiments when VNx is connected to a different network equipment (no Alveo).

@108anup
Copy link
Contributor Author

108anup commented Aug 9, 2022

For the above table, I sent 1 billion packets (basically the PRODUCER is set as in the notebook). Results are similar for 1 million packets as well.

  1. Could you please elaborate why 128B is worst than 64B payload.
  2. Hmm.., so the QSFP and wire support more than 100Gbps? In my setup, the FPGAs are connected to a 100G switch.

@mariodruiz
Copy link
Collaborator

  1. I mentioned above, each IP need at least one more cycle to process each segment. For 128-Byte, these stack creating the highest overhead.
  2. The QSFP28 has four line that can work up to 28 Gbps each.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question in regards the design or implementation
Projects
None yet
Development

No branches or pull requests

2 participants