title |
---|
Hands On Machine Learning |
Search the hyperparameters manually is very tedious, GridSearchCV
may search for you after tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
!!! tip When you have no idea what value a hyperparameter should have, a simple approach is to try out consecutive powers of 10 or a smaller number if you want a more fine-grained search.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
# petal length and width
X = iris.data[:, 2:]
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)
tree_clf.predict([[5, 1.5])
SkLearn uses the CART algorithm. CART cost function for classification is
where
The CART algorithm is a greedy algorithms: it greedily searches for an optimum split at the top level, then repeats the process at each level. It does not check whether or not the split will lead to the lowest possible impurity several levels down.
Finding the optimal tree is known to be an NP-Complete problem: it requires
- Predictions:
$O(\log m)$ , where$m$ is the number of instances.- traversing the tree from the root to a leaf, the height of the tree is
$O(\log m)$ .
- traversing the tree from the root to a leaf, the height of the tree is
- Training:
$O(nm \log m)$ , where$n$ is the number of features.- compares all features$n$ on all samples$m$ at every level of the tree.
Most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ,
- Gini impurity tends to isolate the most frequent class in its own branch of the tree
- while entropy tends to produce slightly more balanced trees.
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
CART cost function for regression:
$$J(k, t_k)= \frac{m_{left}}{m} \text{MSE}{left} + \frac{m{right}}{m}\text{MSE}_{right}$$
Decision Trees love orthogonal decision boundaries, which makes them sensitive to training set rotation.
After the dataset is rotated by 45°, the decision boundary looks unnecessarily convoluted.
One way to limit this problem is to use PCA, which often results in a better orientation of the training data.
More generally, the main issue with Decision Trees is that they are very sensitive to small variations in the training data.