Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance metrics vary between active lab machines #221

Open
ProjectsByJackHe opened this issue Jul 1, 2024 · 4 comments
Open

Performance metrics vary between active lab machines #221

ProjectsByJackHe opened this issue Jul 1, 2024 · 4 comments
Assignees
Labels
lab Specific to lab environment P0
Milestone

Comments

@ProjectsByJackHe
Copy link
Collaborator

ProjectsByJackHe commented Jul 1, 2024

From many recent Secnetperf runs,

there is a noticeable performance difference between some lab machines --- even though the code is the exact same, BIOS configuration is the same, SR-IOV is both enabled...

For example, in this run: https://github.com/microsoft/netperf/actions/runs/11565315387/job/32192210637,
lab machines 05, 10 were assigned for this windows iocp test job (WIN-2CSMQHE8ML4)
And were able to get sub 20 Gb/ps throughput on tcp + iocp.
Image

But in this run: https://github.com/microsoft/netperf/actions/runs/11564283006/job/32189283432
lab machines 25, 26 were assigned for this windows iocp test job (WIN-B9SEU47NHOT)
And we were getting sub 9.8 Gb/ps throughput on tcp + iocp.
Image

This is also true comparing the lab machines hosting the static lab VMs and the dynamic lab VMs from the stateless lab . (lab machines 42, 43)

We need to investigate why performance data is so different between the static lab, and the stateless lab .

This unblocks #73

@ProjectsByJackHe ProjectsByJackHe added the lab Specific to lab environment label Jul 1, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Netperf 1.0 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate. Netperf Version 1 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate. Jul 1, 2024
@nibanks nibanks added this to the v1 milestone Jul 3, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Netperf Version 1 TODO: Onboard more lab machines as self hosted runners so we are at 30-50% utilization rate. Achieve a 30-50% lab utilization rate. Jul 5, 2024
@ProjectsByJackHe ProjectsByJackHe self-assigned this Aug 8, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Achieve a 30-50% lab utilization rate. Uniform performance metrics for the entire cluster Oct 24, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Uniform performance metrics for the entire cluster Uniform performance metrics across all lab machines in-use Oct 24, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Uniform performance metrics across all lab machines in-use Uniform performance metrics across all active lab machines Oct 24, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Uniform performance metrics across all active lab machines Performance metrics vary between machines in our perf lab Oct 24, 2024
@ProjectsByJackHe ProjectsByJackHe changed the title Performance metrics vary between machines in our perf lab Performance metrics vary between active lab machines Oct 24, 2024
@ProjectsByJackHe
Copy link
Collaborator Author

Here is what the RSS configurations look like on a "good perf" pair:

client: RR1-NETPERF-05-RSS.txt
server: RR1-NETPERF-10-RSS.txt

Here is what the RSS configurations look like on a "bad perf" pair:

client: RR1-NETPERF-25-RSS.txt
server: RR1-NETPERF-26-RSS.txt

TLDR; they look the same.

@nibanks
Copy link
Member

nibanks commented Oct 30, 2024

cc @mtfriesen do you see any notable differences in these? I agree that the physical adapters 'Mellanox ConnectX-6 Dx Adapter' all look the same:

Name                                            : SLOT 2 Port 1
InterfaceDescription                            : Mellanox ConnectX-6 Dx Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 8
Profile                                         : Closest
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 1:38
MaxProcessors                                   : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 1:0/0  1:2/0  1:4/0  1:6/0  1:8/0  1:10/0  1:12/0  1:14/0
                                                  1:16/0  1:18/0  1:20/0  1:22/0  1:24/0  1:26/0  1:28/0  1:30/0
                                                  1:32/0  1:34/0  1:36/0  1:38/0  0:0/32767  0:2/32767  0:4/32767  
                                                  0:6/32767
                                                  0:8/32767  0:10/32767  0:12/32767  0:14/32767  0:16/32767  
                                                  0:18/32767  0:20/32767  0:22/32767
                                                  0:24/32767  0:26/32767  0:28/32767  0:30/32767  0:32/32767  
                                                  0:34/32767  0:36/32767  0:38/32767
IndirectionTable: [Group:Number]                : 1:0	1:2	1:4	1:6	1:8	1:10	1:12	0:0	
                                                  1:0	1:2	1:4	1:6	1:8	1:10	1:12	0:0	

The virtual adapters also look the same, except for some minor naming differences (Hyper-V Virtual Ethernet Adapter vs vEthernet (Mellanox ConnectX-6 Dx Adapter - Virtual Switch)):

Name                                            : vEthernet (200G)
InterfaceDescription                            : Hyper-V Virtual Ethernet Adapter
Enabled                                         : True
NumberOfReceiveQueues                           : 16
Profile                                         : ClosestStatic
BaseProcessor: [Group:Number]                   : 0:0
MaxProcessor: [Group:Number]                    : 0:63
MaxProcessors                                   : 8
RssProcessorArray: [Group:Number/NUMA Distance] : 0:0/0  0:2/0  0:4/0  0:6/0  0:8/0  0:10/0  0:12/0  0:14/0
                                                  0:16/0  0:18/0  0:20/0  0:22/0  0:24/0  0:26/0  0:28/0  0:30/0
                                                  0:32/0  0:34/0  0:36/0  0:38/0  
IndirectionTable: [Group:Number]                : 1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	
                                                  1:14	1:0	1:2	1:4	1:6	1:8	1:10	1:12	

@mtfriesen
Copy link
Collaborator

I don't see any obvious functional differences, but the two machines were obviously not set up using the same steps, because one set of machines has "vEthernet (200G)" in the NIC name, and the other has the default "vEthernet (Mellanox ConnectX-6 Dx Adapter - Virtual Switch)"

If it ain't automated, it ain't gonna be repeatable or consistent, no matter how hard we manually try to get these two to behave the same.

@nibanks
Copy link
Member

nibanks commented Oct 30, 2024

I agree we need to standardize the setup process. That will be done as a part of moving to dynamic machines.

But since RSS seems to be the same, I suspect the problem is actually in the code. Unless there is code to consistently line up with RSS and NUMA, tests will generally randomly use different processors/threads/NUMA. The scheduler or thread pool can only do so much for you. So, the next step is to grab some CPU traces for 'good' and 'bad' runs. I suspect you will be able to get both on the same set of machines, given enough runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lab Specific to lab environment P0
Projects
None yet
Development

No branches or pull requests

3 participants