Datasets

MNIST - pykitml.datasets.mnist module

This module contains helper functions to download and load MNIST and MNIST like datasets.

pykitml.datasets.mnist.get(type='classic')

Downloads the MNIST dataset and saves it as a pickle file, mnist.pkl.

Parameters:

type (str) –

The type of MNIST dataset to download.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the file cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the mnist.pkl file, you don’t need to call this method again.

pykitml.datasets.mnist.load()

Loads MNIST dataset from saved pickle file mnist.pkl to numpy arrays.

Returns:
  • training_data (numpy.array) – 60,000x784 numpy array, each row contains flattened version of training images.
  • training_targets (numpy.array) – 60,000x10 numpy array that contains one hot target array of the corresponding training images.
  • testing_data (numpy.array) – 10,000x784 numpy array, each row contains flattened version of test images.
  • testing_targets (numpy.array) – 10,000x10 numpy array that contains one hot target array of the corresponding test images.
Raises:FileNotFoundError – If mnist.pkl file does not exist, i.e, if the dataset was not downloaded and saved using the get() method.

Iris - pykitml.datasets.iris module

pykitml.datasets.iris.load()

Loads the iris dataset without any preprocessing. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals

Inputs have the following features/columns:

sepal-length sepal-width petal-length petal-width

Outputs:

[1, 0, 0] - Iris-setosa, [0, 1, 0] - Iris-versicolor, [0, 0, 1] - Iris-virginica.
Returns:
  • inputs_train (numpy.array) – 120x4 numpy array, each row having 4 features,
  • outputs_train (numpy.array) – 120x3 numpy array, contains 150 one-hot vectors, each corresponding to a category,
  • inputs_test (numpy.array) – 30x4 numpy array, each row having 4 features,
  • outputs_test (numpy.array) – 30x3 numpy array, contains 150 one-hot vectors, each corresponding to a category,

Fish Length - pykitml.datasets.fishlength module

pykitml.datasets.fishlength.load()

Loads the fish length dataset without any preprocessing. Source: https://people.sc.fsu.edu/~jburkardt/datasets/regression/x06.txt

The length of a species of fish is to be represented as a function of the age and water temperature. The fish are kept in tanks at 25, 27, 29 and 31 degrees Celsius. After birth, a test specimen is chosen at random every 14 days and its length measured.

Returns:
  • inputs (numpy.array) – 44x2 numpy array, each row having 2 features, age temperature
  • outputs (numpy.array) – Length of fish, numpy array with 44 elements.

Heart Disease - pykitml.datasets.heartdisease module

pykitml.datasets.heartdisease.get()

Downloads heartdisease dataset from https://archive.ics.uci.edu/ml/datasets/Heart+Disease and saves it as a pkl file heartdisease.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the file cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the heartdisease.pkl file, you don’t need to call this method again.

pykitml.datasets.heartdisease.load()

Loads heart disease dataset from saved pickle file heartdisease.pkl to numpy arrays. Loads data without any preprocessing.

Returns:
  • inputs (numpy.array) – 297x13 numpy array. 297 training examples, each example having 13 inputs(columns). The 13 columns corresponds to: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal.
    • age : Age in years
    • sex : 1=male, 0=female
    • cp : Chest pain type (1=typical-angina, 2=atypical-angina 3=non-anginal 4=asymptomatic)
    • trestbps : Resting blood pressure in mmHg
    • chol : Serum cholesterol in mg/dl
    • fbs : Fasting blood sugar > 120 mg/dl? (1=true, 0=false)
    • restecg : Resting electrocardiographic results (0=normal, 1=ST-T-abnormality 2= left-ventricular-hypertrophy)
    • thalach : Maximum heart rate achieved
    • exang : Exercise induced angina (1=yes, 0=no)
    • oldpeak : ST depression induced by exercise relative to rest
    • slope: Slope of the peak exercise ST segment (1=upsloping 2=flat 3=downsloping)
    • ca : Number of major vessels colored by flourosopy (0-3)
    • thal: 3=normal, 6=fixed-defect, 7=reversable-defect
  • outputs (numpy.array) – Numpy array with 297 elements.
    • 0: < 50% diameter narrowing
    • 1: > 50% diameter narrowing
Raises:FileNotFoundError – If heartdisease.pkl file does not exist, i.e, if the dataset was not downloaded and saved using the get() method.

Adult - pykitml.datasets.adult module

pykitml.datasets.adult.get()

Downloads adult dataset from https://archive.ics.uci.edu/ml/datasets/adult and saves it as a pkl files adult.data.pkl and adult.test.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the files cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the adult.data.pkl and adult.test.pkl files, you don’t need to call this method again.

pykitml.datasets.adult.load()

Loads the adult dataset from adult.data.pkl and adult.test.pkl files. The inputs have the following columns:

  • age
  • workclass : Private=0, Self-emp-not-inc=1, Self-emp-inc=2, Federal-gov=3, Local-gov=4, State-gov=5, Without-pay=6, Never-worked=7
  • fnlwgt
  • education : Bachelors=0, Some-college=1, 11th=2, HS-grad=3, Prof-school=4, Assoc-acdm=5, Assoc-voc=6, 9th=7, 7th-8th=8, 12th=9, Masters=10, 1st-4th=11, 10th=12, Doctorate=13, 5th-6th=14, Preschool=15
  • marital-status : Married-civ-spouse=0, Divorced=1, Never-married=2, Separated=3, Widowed=4, Married-spouse-absent=5, Married-AF-spouse=6
  • occupation : Tech-support=0, Craft-repair=1, Other-service=2, Sales=3, Exec-managerial=4, Prof-specialty=5, Handlers-cleaners=6, Machine-op-inspct=7, Adm-clerical=8, Farming-fishing=9, Transport-moving=10, Priv-house-serv=11, Protective-serv=12, Armed-Forces=13
  • relationship : Wife=0, Own-child=1, Husband=2, Not-in-family=3, Other-relative=4, Unmarried=5
  • race : White=0, Asian-Pac-Islander=1, Amer-Indian-Eskimo=2, Other=3, Black=4
  • sex : Female=0, Male=1
  • capital-gain
  • capital-loss
  • hours-per-week
  • native-country United-States=0, Cambodia=1, England=2, Puerto-Rico=3, Canada=4, Germany=5, Outlying-US(Guam-USVI-etc)=6, India=7, Japan=8, Greece=9, South=10, China=11, Cuba=12, Iran=13, Honduras=14, Philippines=15, Italy=16, Poland=17, Jamaica=18, Vietnam=19, Mexico=20, Portugal=21, Ireland=22, France=23, Dominican-Republic=24, Laos=25, Ecuador=26, Taiwan=27, Haiti=28, Columbia=29, Hungary=30, Guatemala=31, Nicaragua=32, Scotland=33, Thailand=34, Yugoslavia=35, El-Salvador=36, Trinadad&Tobago=37, Peru=38, Hong=39, Holand-Netherlands=40,

The outputs are:

  • <=50K = 0/False
  • >50K = 1/True
Returns:
  • inputs_train (numpy.array) – 392106x13 numpy array containing training inputs.
  • outputs_train (numpy.array) – Numpy array of size 392106.
  • inputs_test (numpy.array) – 195780x13 numpy array containing testing inputs.
  • outputs_test (numpy.array) – Numpy array of size 195780.
Raises:filesNotFoundError – If adult.data.pkl or adult.test.pkl files does not exist, i.e, if the dataset was not downloaded and saved using the get() method.

Banknote - pykitml.datasets.banknote module

pykitml.datasets.banknote.get()

Downloads the banknote dataset from http://archive.ics.uci.edu/ml/datasets/banknote+authentication and saves it as a pkl file banknote.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the file cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the banknote.pkl file, you don’t need to call this method again.

pykitml.datasets.banknote.load()

Loads the banknote data from pkl file.

The inputs have the following columns:

  • Variance of Wavelet Transformed image (continuous)
  • Skewness of Wavelet Transformed image (continuous)
  • Curtosis of Wavelet Transformed image (continuous)
  • Entropy of image (continuous)

The outputs are:

  • 0 = Real
  • 1 = Counterfeit
Returns:
  • inputs_train (numpy.array) – 1102x4 numpy array containing training inputs.
  • outputs_train (numpy.array) – Numpy array of size 1102.
  • inputs_test (numpy.array) – 270x4 numpy array containing testing inputs.
  • outputs_test (numpy.array) – Numpy array of size 270.

Sonar Rocks and Mines - pykitml.datasets.sonar module

pykitml.datasets.sonar.get()

Downloads sonar dataset from https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks) and saves it as a pkl file sonar.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the files cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the sonar.pkl file, you don’t need to call this method again.

pykitml.datasets.sonar.load()

Loads the adult dataset from sonar.pkl file.

Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.

The label associated with each record contains the letter “R” if the object is a rock and “M” if it is a mine (metal cylinder).

Returns:
  • inputs_train (numpy.array) – 190x60 numpy array containing training inputs.
  • outputs_train (numpy.array) – Numpy array of size 190.
  • inputs_test (numpy.array) – 18x60 numpy array containing testing inputs.
  • outputs_test (numpy.array) – Numpy array of size 18.
Raises:filesNotFoundError – If sonar.pkl file does not exist, i.e, if the dataset was not downloaded and saved using the get() method.

Boston Housing - pykitml.boston.s1clustering module

pykitml.datasets.boston.get()

Downloads the boston dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and saves it as a pkl file boston.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the file cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the boston.pkl file, you don’t need to call this method again.

pykitml.datasets.boston.load()

Loads the boston housing dataset from pkl file.

The inputs have following columns:

  • CRIM : per capita crime rate by town
  • ZN : proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS : proportion of non-retail business acres per town
  • CHAS : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX : nitric oxides concentration (parts per 10 million)
  • RM : average number of rooms per dwelling
  • AGE : proportion of owner-occupied units built prior to 1940
  • DIS : weighted distances to five Boston employment centres
  • RAD : index of accessibility to radial highways
  • TAX : full-value property-tax rate per $10,000
  • PTRATIO : pupil-teacher ratio by town
  • B : 1000(Bk - 0.63)^2 where Bk is the proportion of black by town
  • LSTAT : % lower status of the population

The outputs are

  • MEDV : Median value of owner-occupied homes in $1000’s
Returns:
  • inputs_train (numpy.array)
  • outputs_train (numpy.array)
  • inputs_test (numpy.array)
  • outputs_test (numpy.array)

S1 Clustering - pykitml.datasets.s1clustering module

pykitml.datasets.s1clustering.get()

Downloads the s1 clustering dataset from http://cs.joensuu.fi/sipu/datasets/ and save is as a pkl file s1.pkl.

Raises:
  • urllib.error.URLError – If internet connection is not available or the URL is not accessible.
  • OSError – If the file cannot be created due to a system-related error.
  • KeyError – If invalid/unknown type.

Note

You only need to call this method once, i.e, after the dataset has been downloaded and you have the s1.pkl file, you don’t need to call this method again.

pykitml.datasets.s1clustering.load()

Loads x, y points from the s1 clustering dataset from saved pickle file s1.pkl to numpy array. S1 clustering dataset contains 15 clusters.

Returns:training_data – 5000x2 numpy array containing x, y points.
Return type:numpy.array
Raises:FileNotFoundError – If s1.pkl file does not exist, i.e, if the dataset was not downloaded and saved using the get() method.