Preprocessing Datasets

Dealing with categorical/one-hot values

pykitml.onehot(input_array)

Converts input array to one-hot array.

Parameters:input_array (numpy.array) – The input numpy array.
Returns:one_hot – The converted onehot array.
Return type:numpy.array

Example

>>> import numpy as np
>>> import pykitml as pk
>>> a = np.array([0, 1, 2])
>>> pk.onehot(a)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
pykitml.onehot_cols(dataset, cols)

Converts/replaces columns of dataset to one-hot values.

Parameters:
  • dataset (numpy.array) – The input dataset.
  • cols (list) – The columns which has to be replaced/converted to one-hot values.
Returns:

dataset_new – The new dataset with replaced columns.

Return type:

numpy.array

Example

>>> import pykitml as pk
>>> import numpy as np
>>> a = np.array([[0, 1, 2.2], [1, 2, 3.4], [0, 0, 1.1]])
>>> a
array([[0. , 1. , 2.2],
       [1. , 2. , 3.4],
       [0. , 0. , 1.1]])
>>> pk.onehot_cols(a, cols=[0, 1])
array([[1. , 0. , 0. , 1. , 0. , 2.2],
       [0. , 1. , 0. , 0. , 1. , 3.4],
       [1. , 0. , 1. , 0. , 0. , 1.1]])
pykitml.onehot_cols_traintest(dataset_train, dataset_test, cols)

Converts/replaces columns of dataset_train and dataset_test to one-hot values.

Parameters:
  • dataset_train (numpy.array) – The training dataset.
  • dataset_test (numpy.array) – The testing dataset.
  • cols (list) – The columns which has to be replaced/converted to one-hot values.
Returns:

  • dataset_train_new (numpy.array) – The new training dataset with replaced columns.
  • dataset_test_new (numpy.array) – The new testing dataset with replaced columns.

Example

>>> import pykitml as pk
>>> import numpy as np
>>> a_train = np.array([[0, 1, 3.2], [1, 2, 3.5], [0, 0, 3.4]])
>>> a_test = np.array([[0, 3, 3.2], [1, 2, 4.5], [1, 3, 4.5]])
>>> a_train_onehot, a_test_onehot = pk.onehot_cols_traintest(a_train, a_test, cols=[0,1])
>>> a_train_onehot
array([[1. , 0. , 0. , 1. , 0. , 0. , 3.2],
       [0. , 1. , 0. , 0. , 1. , 0. , 3.5],
       [1. , 0. , 1. , 0. , 0. , 0. , 3.4]])
>>> a_test_onehot
array([[1. , 0. , 0. , 0. , 0. , 1. , 3.2],
       [0. , 1. , 0. , 0. , 1. , 0. , 4.5],
       [0. , 1. , 0. , 0. , 0. , 1. , 4.5]])

Generating Polynomial Features

pykitml.polynomial(dataset_inputs, degree=3, cols=[])

Generates polynomial features from the input dataset. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [a, b, a^2, ab, b^2], and degree-3 polynomial features are [a, b, a^2, ab, b^2, a^3, (a^2)*b, a*(b^2), b^3].

Parameters:
  • dataset_inputs (numpy.array) – The input dataset to generate the polynomials from.
  • degree (int) – The degree of the polynomial.
  • cols (list) – The columns to use to generate polynomial features, columns not in this list will be ignored. If empty (default), all columns will used to generate polynomial features.
Returns:

The new dataset with polynomial features.

Return type:

numpy.array

Example

>>> import numpy as np
>>> import pykitml as pk
>>> pk.polynomial(np.array([[1, 2], [2, 3]]), degree=2)
array([[1., 2., 1., 2., 4.],
       [2., 3., 4., 6., 9.]])
>>> pk.polynomial(np.array([[1, 2], [2, 3]]), degree=3)
array([[ 1.,  2.,  1.,  2.,  4.,  1.,  2.,  4.,  8.],
       [ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.]])
>>> pk.polynomial(np.array([[1, 4, 5, 2], [2, 5, 6, 3]]), degree=2, cols=[0, 3])
array([[1., 4., 5., 2., 1., 2., 4.],
       [2., 5., 6., 3., 4., 6., 9.]])