This repository provides modules to visualize the various features of the Decision Tree Learning methods such as feature importance,split points and mutual information.Through visualization of these features users will be able to understand the in depth working of the Decision Tree modules in a comprehensive manner.Visualization of decision tree features can be done using various tools and libraries in Python by embedding it in the ECL code.One of the most commonly used libraries for visualizing decision trees is numpy along with the scikit-learn library.
- Provide a clear visual representation of the decision tree structure in ECL.
- Highlight the importance of different features in making decisions.
- Present decision rules in a human-readable format to enhance interpretability.
- Provide an interactive visualization tool that allows users to explore different parts of the decision tree.
- Serve as an educational tool for users unfamiliar with the specifics of decision trees in ECL.
- Provide insights into the inner workings of the model for training and knowledge transfer.
- Optimize the visualization for performance to handle larger datasets and trees efficiently.
This repository contains 3 files
- feature_imp.ecl
Feature importance in decision trees refers to the quantification of the contribution of each feature to the model's predictive performance. It helps to identify which features have a more significant impact on the decision-making process of the tree.The module visualizes the feature importance of each independent feature in the form of a bar graph with the specified names for the features and depicts the relative importance of each feature on a scale of 1.00. - mutualinfo.ecl
Mutual information is a measure of the amount of information that knowing the value of one variable (feature) provides about another variable (target class). In the context of decision trees, mutual information is commonly used as a criterion to determine the best feature to split on at each node. It helps quantify the reduction in uncertainty about the target variable that is achieved by considering a particular feature for the split.This module visualizes the mutual information between each independent and the dependent feature in the form of a bar graph showing the relative information gain on a scale of 1.00 - splitpoint.eclnb
In decision trees, split points refer to the values or conditions used to partition the data at each internal node. These split points determine how the data is divided into subsets as the tree branches from the root to the leaves. The choice of split points is a crucial aspect of building an effective decision tree.This module visualizes the split points of each independent feature as points on a bar graph showcasing each feature's min-max values.
The input dataset records are loaded into ECL and Python along with necessary libraries such as pandas, numpy and Scikit-learn libraries that are used to construct Decision Trees models are embedded into ECL code using the EMBED function.The required feature from the decision tree is extracted using the dedicated library function and the results are sent to ECL in the form of a one dimensional list.The visualizer in ECL is used to visualize the information for quicker and better understanding of the Decision tree working.
- scikit-learn
- pandas
- numpy
The input dataset records are loaded into ECL and Python along with necessary libraries such as pandas, numpy and Scikit-learn libraries that are used to construct Decision Trees models are embedded into ECL code using the EMBED function.The required feature from the decision tree is extracted using the dedicated library function and the results are sent to ECL in the form of a one dimensional list.The visualizer in ECL is used to visualize the information for quicker and better understanding of the Decision tree working.
Note that all models are trained on the IRIS dataset.
- feature_imp.ecl
Graph showcasing the feature importance of the independent features of the IRIS dataset. - mutualinfo.ecl
Graph showcasing the mutual importance of each independent feature with the dependent. - splitpoint.eclnb
Graph showcasing the max-min values of each independent feature and the different splits points of each feature.
Assuming that HPCC cluster is up and running in your computer: -
- Install ML Core as an ECL bundle by running the below in your terminal or command prompt.
ecl bundle install https://github.com/hpcc-systems/ML_Core.git
- Install Visualizer as an ECL bundle by running the below in your termianl or command prompt.
ecl bundle install https://github.com/hpcc-systems/Visualizer.git
- Now that the dependencies have been taken care of, we can run
feature_imp.ecl
on thor
ecl run thor feature_imp.ecl
- Note that splitpoint.eclnb will only run in Visual Studio Code with the required ECL extension.