Decision Tree
Class Reference
- class pykitml.DecisionTree(input_size, output_size, feature_type=[], max_depth=6, min_split=2, max_splits_eval=100, regression=False)
Implements Decision Tree model.
- __init__(input_size, output_size, feature_type=[], max_depth=6, min_split=2, max_splits_eval=100, regression=False)
- Parameters:
input_size (int) – Size of input data or number of input features.
output_size (int) – Number of categories or groups.
feature_type (list) – List of string describing the type of feature for each column. Can be
'continues','ranked', or'categorical'.max_depth (int) – The maximum depth the tree can grow to. Prevents from overfitting (somewhat).
min_split (int) – The minimum number of data points a node should have to get split.
max_splits_eval (int) – The maximum number of split points to evaluate for an attribute. If the number of candidate split points exceed this,
max_splits_evalsplit candidates will be randomly sampled from the candidates and only the sampled ones will be evaluated from finding the best split point.regression (bool) – If the tree is being trained on a regression problem.
- Raises:
InvalidFeatureType – Invalid/Unknown feature type. Can only be
'continues','ranked', or'categorical'.
- feed(input_data)
Accepts input array and feeds it to the model.
- Parameters:
input_data (numpy.array) – The input to feed the model.
- Raises:
ValueError – If the input data has invalid dimensions/shape.
Note
This function only feeds the input data, to get the output after calling this function use
get_output()orget_output_onehot()
- get_output()
Returns the output activations of the model.
- Returns:
The output activations.
- Return type:
numpy.array
- get_output_onehot()
Returns the output layer activations of the model as a one-hot array. A one-hot array is an array of bits in which only one of the bits is high/true. In this case, the corresponding bit to the neuron/node having the highest activation will be high/true.
- Returns:
The one-hot output activations array.
- Return type:
numpy.array
- train(inputs, outputs)
Trains the model on the training data.
- Parameters:
training_data (numpy.array) – numpy array containing training data.
targets (numpy.array) – numpy array containing training targets, corresponding to the training data.
- Raises:
numpy.AxisError – If output_size is less than two. Use
pykitml.onehot()to change 0/False to [1, 0] and 1/True to [0, 1] for binary classification.
- accuracy(testing_data, testing_targets)
Tests the accuracy of the model on the testing data passed to the function. This function should be only used for classification.
- Parameters:
testing_data (numpy.array) – numpy array containing testing data.
testing_targets (numpy.array) – numpy array containing testing targets, corresponding to the testing data.
- Returns:
accuracy – The accuracy of the model over the testing data i.e how many testing examples did the model predict correctly.
- Return type:
float
- confusion_matrix(test_data, test_targets, gnames=[], plot=True)
Returns and plots confusion matrix on the given test data.
- Parameters:
test_data (numpy.array) – Numpy array containing test data
test_targets (numpy.array) – Numpy array containing the targets corresponding to the test data.
plot (bool) – If set to false, will not plot the matrix. Default is true.
gnames (list) – List of string names for each class/group.
- Returns:
confusion_matrix – The confusion matrix.
- Return type:
numpy.array
- r2score(testing_data, testing_targets)
Return R-squared or coefficient of determination value.
- Parameters:
testing_data (numpy.array) – numpy array containing testing data.
testing_targets (numpy.array) – numpy array containing testing targets, corresponding to the testing data.
- Returns:
r2score – The average cost of the model over the testing data.
- Return type:
float
- Raises:
ValueError – If
testing_dataortesting_targetshas invalid dimensions/shape.
- show_tree()
Draws a visualization/graph of the tree.
Example: Classifying Iris
Dataset
Iris - pykitml.datasets.iris module
Training
import pykitml as pk
from pykitml.datasets import iris
# Load iris data set
inputs_train, outputs_train, inputs_test, outputs_test = iris.load()
# Create model
tree_iris_classifier = pk.DecisionTree(4, 3, max_depth=5, feature_type=['continues']*4)
# Train
tree_iris_classifier.train(inputs_train, outputs_train)
# Save it
pk.save(tree_iris_classifier, 'tree_iris_classifier.pkl')
# Print accuracy
accuracy = tree_iris_classifier.accuracy(inputs_train, outputs_train)
print('Train accuracy:', accuracy)
accuracy = tree_iris_classifier.accuracy(inputs_test, outputs_test)
print('Test accuracy:', accuracy)
# Plot confusion matrix
tree_iris_classifier.confusion_matrix(inputs_test, outputs_test,
gnames=['Setosa', 'Versicolor', 'Virginica'])
# Plot decision tree
tree_iris_classifier.show_tree()
Predict type of species with sepal-length, sepal-width, petal-length, petal-width: 5.8, 2.7, 3.9, 1.2
import numpy as np
import pykitml as pk
# Predict type of species with
# sepal-length sepal-width petal-length petal-width
# 5.8, 2.7, 3.9, 1.2
input_data = np.array([5.8, 2.7, 3.9, 1.2])
# Load the model
tree_iris_classifier = pk.load('tree_iris_classifier.pkl')
# Get output
tree_iris_classifier.feed(input_data)
model_output = tree_iris_classifier.get_output_onehot()
# Print result
print(model_output)
Tree Graph
Confusion Matrix