You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced 'corehour' and 'lifetime' within trace_dataframe.
so I could reproduce the results indicated in with minor updates:
so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column 'vmcategory' you get different result!
importnumpyasnpimportpandasaspdfromIPython.displayimportdisplayimportmatplotlib.pyplotasplt%matplotlibinline#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'data_path='https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'headers= ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data=pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)#slice datadf=data[['vmcategory', 'vmid']]
#count occurrence of each value in 'vmcategory' columncounts=df.vmcategory.value_counts()
#count occurrence of each value in 'vmcategory' column as percentage of totalpercs=df.vmcategory.value_counts(normalize=True)
#count occurrence of each value in 'team' column as percentage of totalperc=df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) +'%'#concatenate results into one DataFrametf=pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))
+-------------------+---------+--------------+-------+|vmcategory|count|percentage|%|+===================+=========+==============+=======+|Unknown|2457455|0.911672|91.2%|+-------------------+---------+--------------+-------+|Delay-insensitive|159615|0.0592143|5.9%|+-------------------+---------+--------------+-------+|Interactive|78478|0.0291139|2.9%|+-------------------+---------+--------------+-------+
observation shows that the latter indicates Unknown class is 91% while 1st approach shows Delay-insensitive class has a 58% proportion of all VMs!!
Which one is correct? can someone explain why we need to calculate and instead of using df.vmcategory.value_counts(normalize=True) directly on interested column 'vmcategory', the data provider used df.groupby('vmcategory')['corehour'].sum().rename('corehour'))?
Can someone shed light on these inconsistency outputs from domain knowledge in cloud data and the reasoning for the role of 'corehour' in VM categorization or classification in the first approach is.
PS: if one is interested in how calculated out of 'lifetime' and 'vmcorecountbucket' and how they are related:
#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hourtrace_dataframe['lifetime'] =np.maximum((trace_dataframe['vmdeleted'] -trace_dataframe['vmcreated']),300)/3600#Compute VM corehourtrace_dataframe['corehour'] =trace_dataframe['lifetime'] *trace_dataframe['vmcorecountbucket']
The text was updated successfully, but these errors were encountered:
clevilll
changed the title
Reasoning incompatible general statistics offered for 'vmcategory' for AzurePublicDatasetV2
Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2
Aug 8, 2024
Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced
'corehour'
and'lifetime'
withintrace_dataframe
.so I could reproduce the results indicated in with minor updates:
in short using this way we have following for "Azure 2019 - Public Dataset V2" and "Azure 2019" :
so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column
'vmcategory'
you get different result!observation shows that the latter indicates
Unknown
class is 91% while 1st approach showsDelay-insensitive
class has a 58% proportion of all VMs!!df.vmcategory.value_counts(normalize=True)
directly on interested column'vmcategory'
, the data provider useddf.groupby('vmcategory')['corehour'].sum().rename('corehour'))
?'corehour'
in VM categorization or classification in the first approach is.PS: if one is interested in how calculated out of
'lifetime'
and'vmcorecountbucket'
and how they are related:The text was updated successfully, but these errors were encountered: