Performance metrics vary between active lab machines #221

ProjectsByJackHe · 2024-07-01T19:12:11Z

From many recent Secnetperf runs,

there is a noticeable performance difference between some lab machines --- even though the code is the exact same, BIOS configuration is the same, SR-IOV is both enabled...

For example, in this run: https://github.com/microsoft/netperf/actions/runs/11565315387/job/32192210637,
lab machines 05, 10 were assigned for this windows iocp test job (WIN-2CSMQHE8ML4)
And were able to get sub 20 Gb/ps throughput on tcp + iocp.

But in this run: https://github.com/microsoft/netperf/actions/runs/11564283006/job/32189283432
lab machines 25, 26 were assigned for this windows iocp test job (WIN-B9SEU47NHOT)
And we were getting sub 9.8 Gb/ps throughput on tcp + iocp.

This is also true comparing the lab machines hosting the static lab VMs and the dynamic lab VMs from the stateless lab . (lab machines 42, 43)

We need to investigate why performance data is so different between the static lab, and the stateless lab .

This unblocks #73

ProjectsByJackHe · 2024-10-30T00:27:27Z

Here is what the RSS configurations look like on a "good perf" pair:

client: RR1-NETPERF-05-RSS.txt
server: RR1-NETPERF-10-RSS.txt

Here is what the RSS configurations look like on a "bad perf" pair:

client: RR1-NETPERF-25-RSS.txt
server: RR1-NETPERF-26-RSS.txt

TLDR; they look the same.

nibanks · 2024-10-30T13:06:25Z

cc @mtfriesen do you see any notable differences in these? I agree that the physical adapters 'Mellanox ConnectX-6 Dx Adapter' all look the same:

Name                                            : SLOT 2 Port 1
InterfaceDescription                            : Mellanox ConnectX-6 Dx Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 8
Profile                                         : Closest
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 1:38
MaxProcessors                                   : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 1:0/0  1:2/0  1:4/0  1:6/0  1:8/0  1:10/0  1:12/0  1:14/0
                                                  1:16/0  1:18/0  1:20/0  1:22/0  1:24/0  1:26/0  1:28/0  1:30/0
                                                  1:32/0  1:34/0  1:36/0  1:38/0  0:0/32767  0:2/32767  0:4/32767  
                                                  0:6/32767
                                                  0:8/32767  0:10/32767  0:12/32767  0:14/32767  0:16/32767  
                                                  0:18/32767  0:20/32767  0:22/32767
                                                  0:24/32767  0:26/32767  0:28/32767  0:30/32767  0:32/32767  
                                                  0:34/32767  0:36/32767  0:38/32767
IndirectionTable: [Group:Number]                : 1:0	1:2	1:4	1:6	1:8	1:10	1:12	0:0	
                                                  1:0	1:2	1:4	1:6	1:8	1:10	1:12	0:0

The virtual adapters also look the same, except for some minor naming differences (Hyper-V Virtual Ethernet Adapter vs vEthernet (Mellanox ConnectX-6 Dx Adapter - Virtual Switch)):

Name                                            : vEthernet (200G)
InterfaceDescription                            : Hyper-V Virtual Ethernet Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 16
Profile                                         : ClosestStatic
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 0:63
MaxProcessors                                   : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                  0:16/0  0:18/0  0:20/0  0:22/0  0:24/0  0:26/0  0:28/0  0:30/0
                                                  0:32/0  0:34/0  0:36/0  0:38/0  
IndirectionTable: [Group:Number]                : 1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12

mtfriesen · 2024-10-30T14:49:59Z

I don't see any obvious functional differences, but the two machines were obviously not set up using the same steps, because one set of machines has "vEthernet (200G)" in the NIC name, and the other has the default "vEthernet (Mellanox ConnectX-6 Dx Adapter - Virtual Switch)"

If it ain't automated, it ain't gonna be repeatable or consistent, no matter how hard we manually try to get these two to behave the same.

nibanks · 2024-10-30T15:13:50Z

I agree we need to standardize the setup process. That will be done as a part of moving to dynamic machines.

But since RSS seems to be the same, I suspect the problem is actually in the code. Unless there is code to consistently line up with RSS and NUMA, tests will generally randomly use different processors/threads/NUMA. The scheduler or thread pool can only do so much for you. So, the next step is to grab some CPU traces for 'good' and 'bad' runs. I suspect you will be able to get both on the same set of machines, given enough runs.

ProjectsByJackHe added the lab Specific to lab environment label Jul 1, 2024

ProjectsByJackHe changed the title ~~Netperf 1.0 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate.~~ Netperf Version 1 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate. Jul 1, 2024

nibanks added this to the v1 milestone Jul 3, 2024

ProjectsByJackHe changed the title ~~Netperf Version 1 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate.~~ Achieve a 30-50% lab utilization rate. Jul 5, 2024

ProjectsByJackHe self-assigned this Aug 8, 2024

ProjectsByJackHe added the P0 label Oct 24, 2024

ProjectsByJackHe changed the title ~~Achieve a 30-50% lab utilization rate.~~ Uniform performance metrics for the entire cluster Oct 24, 2024

ProjectsByJackHe changed the title ~~Uniform performance metrics for the entire cluster~~ Uniform performance metrics across all lab machines in-use Oct 24, 2024

ProjectsByJackHe changed the title ~~Uniform performance metrics across all lab machines in-use~~ Uniform performance metrics across all active lab machines Oct 24, 2024

ProjectsByJackHe changed the title ~~Uniform performance metrics across all active lab machines~~ Performance metrics vary between machines in our perf lab Oct 24, 2024

ProjectsByJackHe changed the title ~~Performance metrics vary between machines in our perf lab~~ Performance metrics vary between active lab machines Oct 24, 2024

This was referenced Oct 26, 2024

Automatic Provisioning of Lab VMs #73

Open

Underutilization of lab machines #385

Open

Add a project tag (ebpf) to clamp down on the variability issue #403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance metrics vary between active lab machines #221

Performance metrics vary between active lab machines #221

ProjectsByJackHe commented Jul 1, 2024 •

edited

Loading

ProjectsByJackHe commented Oct 30, 2024

nibanks commented Oct 30, 2024

mtfriesen commented Oct 30, 2024

nibanks commented Oct 30, 2024

Performance metrics vary between active lab machines #221

Performance metrics vary between active lab machines #221

Comments

ProjectsByJackHe commented Jul 1, 2024 • edited Loading

ProjectsByJackHe commented Oct 30, 2024

nibanks commented Oct 30, 2024

mtfriesen commented Oct 30, 2024

nibanks commented Oct 30, 2024

ProjectsByJackHe commented Jul 1, 2024 •

edited

Loading