Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

clevilll · 2024-08-08T00:25:08Z

Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced 'corehour' and 'lifetime' within trace_dataframe.

so I could reproduce the results indicated in with minor updates:

TraceLegend = "Azure 2019 - Public Dataset V2"
AzureLegend = "Azure 2019"

def CategoryPlot(df, azure2019):
    #dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    #dataset = dataset.rename(columns={'vmcategory': 'Bucket'})
    #dataset['value'] = dataset['corehour']/dataset['corehour'].sum() * 100
    #dataset= dataset.drop('corehour', 1)
    #dataset = dataset.sort_index().T
    #azure2019 = azure2019.to_frame().T
    #frames = [dataset, azure2019]
    #result = pd.concat(frames)

    dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
    del dataset['corehour']
    dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
    result = pd.concat([dataset,azure2019], axis=1, ignore_index=False, sort=True)
    del result['vmcategory']
    result = result.T
    ax = result.plot.bar(stacked=True, title='VM Category Distribution', color=['lightskyblue', 'orange', '0.75'], ylim=(0,100))
    ax.set_ylabel('% of core hours')
    ax.set_xticklabels([TraceLegend, AzureLegend], rotation=0)
    ax.legend(["Delay-insensitive", "Interactive", "Unknown"], loc='upper center', title='Categories', bbox_to_anchor=(0.5, -0.10), ncol=3, fontsize=10.5)

CategoryPlot(trace_dataframe, category_azure2019)

in short using this way we have following for "Azure 2019 - Public Dataset V2" and "Azure 2019" :

dataset = pd.DataFrame(trace_dataframe.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', axis=1)
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
#dataset = dataset.sort_index()#.T
print(dataset.to_markdown(tablefmt="grid"))
+----+-------------------+--------------+
|    | vmcategory        |   percentage |
+====+===================+==============+
|  0 | Delay-insensitive |     58.4396  |
+----+-------------------+--------------+
|  1 | Interactive       |     32.5105  |
+----+-------------------+--------------+
|  2 | Unknown           |      9.04996 |
+----+-------------------+--------------+

azure2019 = category_azure2019#.T
print(azure2019.to_markdown(tablefmt="grid"))
+----+-------------------+---------+
|    | vmcategory        |   value |
+====+===================+=========+
|  0 | Delay-insensitive |      51 |
+----+-------------------+---------+
|  1 | Interactive       |      43 |
+----+-------------------+---------+
|  2 | Unknown           |       6 |
+----+-------------------+---------+

so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column 'vmcategory' you get different result!

import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
data_path  = 'https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'

headers   = ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
         'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data = pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)

#slice data
df= data[['vmcategory', 'vmid']]

#count occurrence of each value in 'vmcategory' column
counts = df.vmcategory.value_counts()

#count occurrence of each value in 'vmcategory' column as percentage of total
percs = df.vmcategory.value_counts(normalize=True)

#count occurrence of each value in 'team' column as percentage of total
perc = df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

#concatenate results into one DataFrame
tf = pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))

+-------------------+---------+--------------+-------+
| vmcategory        |   count |   percentage | %     |
+===================+=========+==============+=======+
| Unknown           | 2457455 |    0.911672  | 91.2% |
+-------------------+---------+--------------+-------+
| Delay-insensitive |  159615 |    0.0592143 | 5.9%  |
+-------------------+---------+--------------+-------+
| Interactive       |   78478 |    0.0291139 | 2.9%  |
+-------------------+---------+--------------+-------+

observation shows that the latter indicates Unknown class is 91% while 1st approach shows Delay-insensitive class has a 58% proportion of all VMs!!

Which one is correct? can someone explain why we need to calculate and instead of using df.vmcategory.value_counts(normalize=True) directly on interested column 'vmcategory', the data provider used df.groupby('vmcategory')['corehour'].sum().rename('corehour'))?
Can someone shed light on these inconsistency outputs from domain knowledge in cloud data and the reasoning for the role of 'corehour' in VM categorization or classification in the first approach is.

PS: if one is interested in how calculated out of 'lifetime' and 'vmcorecountbucket' and how they are related:

#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hour
trace_dataframe['lifetime'] = np.maximum((trace_dataframe['vmdeleted'] - trace_dataframe['vmcreated']),300)/ 3600

#Compute VM corehour
trace_dataframe['corehour'] = trace_dataframe['lifetime'] * trace_dataframe['vmcorecountbucket']

The text was updated successfully, but these errors were encountered:

clevilll changed the title ~~Reasoning incompatible general statistics offered for 'vmcategory' for AzurePublicDatasetV2~~ Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

clevilll commented Aug 8, 2024 •

edited

Loading

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Comments

clevilll commented Aug 8, 2024 • edited Loading

clevilll commented Aug 8, 2024 •

edited

Loading