Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Open
clevilll opened this issue Aug 8, 2024 · 0 comments

Comments

@clevilll
Copy link

clevilll commented Aug 8, 2024

Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced 'corehour' and 'lifetime' within trace_dataframe.

so I could reproduce the results indicated in with minor updates:

TraceLegend = "Azure 2019 - Public Dataset V2"
AzureLegend = "Azure 2019"

def CategoryPlot(df, azure2019):
    #dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    #dataset = dataset.rename(columns={'vmcategory': 'Bucket'})
    #dataset['value'] = dataset['corehour']/dataset['corehour'].sum() * 100
    #dataset= dataset.drop('corehour', 1)
    #dataset = dataset.sort_index().T
    #azure2019 = azure2019.to_frame().T
    #frames = [dataset, azure2019]
    #result = pd.concat(frames)

    dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
    del dataset['corehour']
    dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
    result = pd.concat([dataset,azure2019], axis=1, ignore_index=False, sort=True)
    del result['vmcategory']
    result = result.T
    ax = result.plot.bar(stacked=True, title='VM Category Distribution', color=['lightskyblue', 'orange', '0.75'], ylim=(0,100))
    ax.set_ylabel('% of core hours')
    ax.set_xticklabels([TraceLegend, AzureLegend], rotation=0)
    ax.legend(["Delay-insensitive", "Interactive", "Unknown"], loc='upper center', title='Categories', bbox_to_anchor=(0.5, -0.10), ncol=3, fontsize=10.5)

CategoryPlot(trace_dataframe, category_azure2019)

image
in short using this way we have following for "Azure 2019 - Public Dataset V2" and "Azure 2019" :

dataset = pd.DataFrame(trace_dataframe.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', axis=1)
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
#dataset = dataset.sort_index()#.T
print(dataset.to_markdown(tablefmt="grid"))
+----+-------------------+--------------+
|    | vmcategory        |   percentage |
+====+===================+==============+
|  0 | Delay-insensitive |     58.4396  |
+----+-------------------+--------------+
|  1 | Interactive       |     32.5105  |
+----+-------------------+--------------+
|  2 | Unknown           |      9.04996 |
+----+-------------------+--------------+
azure2019 = category_azure2019#.T
print(azure2019.to_markdown(tablefmt="grid"))
+----+-------------------+---------+
|    | vmcategory        |   value |
+====+===================+=========+
|  0 | Delay-insensitive |      51 |
+----+-------------------+---------+
|  1 | Interactive       |      43 |
+----+-------------------+---------+
|  2 | Unknown           |       6 |
+----+-------------------+---------+

so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column 'vmcategory' you get different result!

import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
data_path  = 'https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'

headers   = ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
         'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data = pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)

#slice data
df= data[['vmcategory', 'vmid']]

#count occurrence of each value in 'vmcategory' column
counts = df.vmcategory.value_counts()

#count occurrence of each value in 'vmcategory' column as percentage of total
percs = df.vmcategory.value_counts(normalize=True)

#count occurrence of each value in 'team' column as percentage of total
perc = df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

#concatenate results into one DataFrame
tf = pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))

+-------------------+---------+--------------+-------+
| vmcategory        |   count |   percentage | %     |
+===================+=========+==============+=======+
| Unknown           | 2457455 |    0.911672  | 91.2% |
+-------------------+---------+--------------+-------+
| Delay-insensitive |  159615 |    0.0592143 | 5.9%  |
+-------------------+---------+--------------+-------+
| Interactive       |   78478 |    0.0291139 | 2.9%  |
+-------------------+---------+--------------+-------+

image

observation shows that the latter indicates Unknown class is 91% while 1st approach shows Delay-insensitive class has a 58% proportion of all VMs!!

  • Which one is correct? can someone explain why we need to calculate and instead of using df.vmcategory.value_counts(normalize=True) directly on interested column 'vmcategory', the data provider used df.groupby('vmcategory')['corehour'].sum().rename('corehour'))?
  • Can someone shed light on these inconsistency outputs from domain knowledge in cloud data and the reasoning for the role of 'corehour' in VM categorization or classification in the first approach is.

PS: if one is interested in how calculated out of 'lifetime' and 'vmcorecountbucket' and how they are related:

#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hour
trace_dataframe['lifetime'] = np.maximum((trace_dataframe['vmdeleted'] - trace_dataframe['vmcreated']),300)/ 3600

#Compute VM corehour
trace_dataframe['corehour'] = trace_dataframe['lifetime'] * trace_dataframe['vmcorecountbucket']
@clevilll clevilll changed the title Reasoning incompatible general statistics offered for 'vmcategory' for AzurePublicDatasetV2 Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant