Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

108anup · 2022-07-29T18:24:19Z

I am not sure what is the best place to ask this question, hence asking here as an issue. Please let me know if some other place is preferred over this.

Q1: I get cases where observed throughput is higher than theoretical maximum, e.g., 97 Gbps throughput with 1472B payload where the theoretical is around 95 Gbps. 2 Gbps is large enough difference that I can't explain using a few cycle measurement mismatch. I noticed the recently changed juptyer notebook also shows this but the mismatch is only 0.0005 Gbps or so. What might be the reasons for higher than theoretical throughput measurement. If we under measure the time to receive packets by even 100 cycles, that may explain 0.0005 Gbps mismatch but not 2 Gbps mismatch.

Q2. I observe relatively lower throughput with 128B payload. Why might that be? Is it to do with bytes transmitted per flit? If so, 64B payload then should also have similar issue (55 bytes per flit). But we see that only 128B payload has ~5 Gbps gap to theoretical maximum, while other payload sizes don't have much gap.

Calculation:
Flit is AXI transfer data width (64B). For 128B payload, frame size = 128 + UDP (8), IP (20), Ethernet(14) and FCS (4) = 174B, requires 3 flits (3 * 64B >= 174B). Or 174/3 bytes sent per flit = 58.
Similarly for 64B payload, bytes per flit is 55. For other payload sizes bytes per flit is >= 60.

mariodruiz · 2022-08-01T09:49:13Z

Hi,

Under what conditions are you getting this large mismatch? The CMAC should limit the maximum tx/rx bandwidth. However, if the burst is short, the FIFOs in the datapath can absorb this difference.
There's an overhead in the HLS IP which is impacts significantly segments in the range of 65 to 160 bytes.

In the throughput notebooks the overhead is shown:

# overhead is UDP (8), IP (20), Ethernet(14), FCS (4), IFG (12), preamble (7), start frame delimiter (1)
overhead = 8 + 20 + 14 + 4 + 12 + 7 + 1

108anup · 2022-08-08T21:52:14Z

I am using the provided notebook. I change MAC/IP address configuration and kernel configuration. I don't make any changes to the FPGA code, i.e., use vanilla xclbin. The TX/RX measurement is done on a single FPGA instead of 2 different FPGAs. The benchmarking FPGA sends packets to a device-under-test (DUT, running open NIC). The DUT forwards packets back to the benchmarking FPGA. The benchmarking FPGA has 2 kernels started (one in PRODUCER and other in CONSUMER mode). The PRODUCER kernel continuously sends packets (not in burst).
1.1. For 64B UDP payload (174B frame or packet), the TX throughput itself is around 1.163 Gbps above theoretical. This difference is at the frame level. At the application (payload), the difference is 0.67 Gbps above theoretical.
1.3 I.e., for 64B payload, throughput reported by notebook = 49.907 Gbps, frame_level_throughput = 85.777 Gbps. Theoretical maximum throughput = 49.231 Gbps at application level and = 84.651 Gbps at frame level.
1.2 (frame_level_throughput = application_level_throughput * (payload + 46) / payload).
64 byte payload (frame size = 110B) should also be impacted by this overhead right. But only 128 byte payload (frame size = 174B) is impacted. From the notebook in upstream code, the throughput difference to theoretical max is 4.6 Gbps for 128B payload (174B segment or frame) while it is 0 for 64B payload (110B segment or frame). Segment or frame refer to bits transferred in one AXI stream transaction.
3.1. From what I understand, based on the reasoning you give, the impact should be witnessed by 64B payload not 128B payload as 64B payload creates segments of size 110B which lie in the range 65 to 160 bytes.

108anup · 2022-08-08T22:02:10Z

To double check, I can run the vanilla un-modified VNx notebook to see if the difference is still there. Since the difference is in TX throughput itself, I would expect the same difference to still be there as TX should not be affected by DUT much. Increase in RX throughput could be maybe from packet duplication (though I have no reason to believe that packets would be duplicated).

mariodruiz · 2022-08-09T07:54:15Z

1.1. For 64B UDP payload (174B frame or packet)

How do you get the 174B in here?

Each of the individual IP that compose the network layer needs one or two clock cycles extra to process each packet/segment. I suppose that for 128-Byte the extra cycles stack impacting the throughput. This is something I haven't profiled, because for bulk data transfer you will not use small packet size.

The low performance for small packet is known. Given the current design, this is unavoidable.

Mario

108anup · 2022-08-09T15:30:51Z

Sorry typo there. For 64B payload (64+46=110B frame)

108anup · 2022-08-09T15:51:22Z

I understand that performance will be low for small packets. The thing I don't understand is why performance for 64B payload is better than performance for 128B payload. I get better throughput with 64B payloads.

Throughput I measure for payload level using provided notebook is
49.907 Gbps for 64B and 60.21 Gbps for 128B. If we translate these to frame level, using frame_level_throughput = payload_level_throughput * (payload size + header size) / (payload size), we get throughput as 85.77 Gbps for 64B payload (110B frame) and 81.84 Gbps for 128B payload, i.e., packet level throughput is higher for 64B payloads.

Even in the provided notebook, the throughput with 128B payloads is 5 Gbps lower than theoretical while for 64B payloads, it is close to theoretical, i.e., the efficiency for 64B payloads (smaller payload) is better than that with 128B payloads (larger payload).

108anup · 2022-08-09T15:51:39Z

Also any thoughts on why TX throughput might be higher than theoretical?

mariodruiz · 2022-08-09T16:18:24Z

This is how I compute these throughputs

udp = 8
ip = 20
eth = 14
fcs = 4
ifg = 12
pp_amble = 8

def thr(payload_size: int):
    total_bytes = payload_size + udp + ip + eth + fcs + ifg + pp_amble
    payload_thr = payload_size / total_bytes
    frame_thr = (payload_size + ip + udp + eth) / total_bytes
    return payload_thr * 100 , frame_thr * 100.0

So, thr(64) = (49.23076923076923, 81.53846153846153) and thr(128) = (65.97938144329896, 87.62886597938144)

I think your theoretical equation does not look right, as the payload (segment) size increases, the overhead decreases (the deficiency increases).

Correct me if I am missing something.

mariodruiz · 2022-08-09T16:20:48Z

Someone already asked about this here, I suppose your theoretical throughput is what this person calls naked cmac, still does not mach the numbers I showed above. But, I believe the Python snippet is the correct way to compute this.

108anup · 2022-08-09T18:06:02Z

My defs are same except: frame_thr = (payload_size + ip + udp + eth + *fcs*) / total_bytes
Frame is what goes into tdata of the AXIS interfaces, i.e., payload + 46 bytes. This is also why tkeep in LATENCY kernel is 18 = 64 - 46, i.e., 18B payload gives frame of size 1 flit (TDATA WIDTH).

For the small packets, the issue is not in theoretical computation. The issue is that the measured frame throughput for 128B payload is smaller than 64B payload (in my setup). Even in the upstream notebook, the measured payload throughput for 128B payload is further away from theoretical (4 Gbps away, 65 Gbps theoretical, 61 Gbps measured) than 64B payload (0 Gbps difference, measured and theoretical = 49.231 Gbps). Even here, the measured L2 bandwidth is higher for 110B packet (95.45) compared to 174B (93.84).
For theoretical calculations (at least for theoretical throughput at payload level), I use exact same calculation as yours. I am seeing higher measured throughput than theoretical throughput at payload level of around 0.6 to 1.3 Gbps.

Payload	Theoretical Gbps at payload level	Measured Gbps at payload level	Difference
64	49.23076923	49.90756385	-0.6767946191
128	65.97938144	60.21115183	5.768229612
192	74.41860465	75.26383693	-0.8452322772
256	79.50310559	80.59648509	-1.093379504
320	82.9015544	84.04169256	-1.140138154
384	85.33333333	86.50693924	-1.173605905
448	87.15953307	88.35825842	-1.198725344
512	88.58131488	89.79962376	-1.218308883
576	89.71962617	90.95361477	-1.233988597
640	90.65155807	91.89833917	-1.246781098
704	91.42857143	92.68604922	-1.257477787
768	92.08633094	93.35285902	-1.266528087
832	92.65033408	93.92464052	-1.274306441
896	93.13929314	94.42033049	-1.281037355
960	93.56725146	94.85417209	-1.286920625
1024	93.94495413	95.23706712	-1.292112994
1088	94.28076256	95.57749555	-1.296732985
1152	94.58128079	95.88214486	-1.300864075
1216	94.85179407	96.15638323	-1.304589155
1280	95.09658247	96.40453013	-1.307947665
1344	95.31914894	96.630162	-1.311013068
1408	95.52238806	96.83620318	-1.313815124
1472	95.70871261	97.02508574	-1.316373129

In above table the measured at payload level is directly reported by the notebook, i.e., from ol_w0_tg.compute_app_throughput('tx') . The theoretical is taken from what you described here, i.e., payload_thr = 100 * payload_size / (payload_size + udp + ip + eth + fcs + ifg + pp_amble).

In this table, there are 2 weird things,

The behavior for 128B payload is different compared to smaller and larger payloads (difference = +5.76Gbps vs -0.67 Gbps (64B) and -0.84 Gbps (192B).
There is significantly larger than theoretical throughput (e.g., 97.02 Gbps measured vs 95.708 Gbps theoretical for 1472B payload).

mariodruiz · 2022-08-09T18:15:37Z

How many packets are you sending for each payload size?

128-Byte is always going to give the worst efficiency. Without redesigning the whole network layer, this cannot be solved.
There's a small chance that the CMAC is overclocked. This only could happen when connected to another Alveo card. You can try to run the same experiments when VNx is connected to a different network equipment (no Alveo).

108anup · 2022-08-09T22:29:40Z

For the above table, I sent 1 billion packets (basically the PRODUCER is set as in the notebook). Results are similar for 1 million packets as well.

Could you please elaborate why 128B is worst than 64B payload.
Hmm.., so the QSFP and wire support more than 100Gbps? In my setup, the FPGAs are connected to a 100G switch.

mariodruiz · 2022-08-10T06:44:49Z

I mentioned above, each IP need at least one more cycle to process each segment. For 128-Byte, these stack creating the highest overhead.
The QSFP28 has four line that can work up to 28 Gbps each.

mariodruiz added the question Question in regards the design or implementation label Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

108anup commented Jul 29, 2022

mariodruiz commented Aug 1, 2022

108anup commented Aug 8, 2022

108anup commented Aug 8, 2022 •

edited

Loading

mariodruiz commented Aug 9, 2022

108anup commented Aug 9, 2022

108anup commented Aug 9, 2022

108anup commented Aug 9, 2022

mariodruiz commented Aug 9, 2022

mariodruiz commented Aug 9, 2022 •

edited

Loading

108anup commented Aug 9, 2022

mariodruiz commented Aug 9, 2022 •

edited

Loading

108anup commented Aug 9, 2022

mariodruiz commented Aug 10, 2022

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

Query: (1) why throughput higher than theoretical, (2) why lower throughput with 128B payload. #81

Comments

108anup commented Jul 29, 2022

mariodruiz commented Aug 1, 2022

108anup commented Aug 8, 2022

108anup commented Aug 8, 2022 • edited Loading

mariodruiz commented Aug 9, 2022

108anup commented Aug 9, 2022

108anup commented Aug 9, 2022

108anup commented Aug 9, 2022

mariodruiz commented Aug 9, 2022

mariodruiz commented Aug 9, 2022 • edited Loading

108anup commented Aug 9, 2022

mariodruiz commented Aug 9, 2022 • edited Loading

108anup commented Aug 9, 2022

mariodruiz commented Aug 10, 2022

108anup commented Aug 8, 2022 •

edited

Loading

mariodruiz commented Aug 9, 2022 •

edited

Loading

mariodruiz commented Aug 9, 2022 •

edited

Loading