Data Mining Assignment: Fashion MNIST DataSet

Created by: Dheenodara Rao UTAR


Using Keras Sequential Model

Using Decision Tree Classifer


1. Reading and Loading Data - (Pandas)

1.1 Reading csv file by using pandas

pandas.read_csv(fileName)

1.2 Converting read data into numpy array

.as_matrix()

1.3 Create makedataset() to easily separate data and target and call them

makedataset(numpyArray)
In [2]:
import numpy as np
#seed(123) is to reproduce results
np.random.seed(123)
import pandas as p
import matplotlib.pyplot as plt

trainFile = 'fashion-mnist_train.csv'
trainData = p.read_csv(trainFile).as_matrix()

testFile = 'fashion-mnist_test.csv'
testData = p.read_csv(testFile).as_matrix()

def makedataset(npArray):
    target = npArray[:,0]
    data = npArray[:,1:]
    
    dataset = {
        "target":target,
        "data": data
    }
    
    return dataset

trainDataSet = makedataset(trainData)
testDataSet = makedataset(testData)

1.4 Separating training and test data set

  • X_train = images data set for training
  • y_train = label data set for training
  • X_test = images data set for testing
  • y_test = label data set for testing
In [3]:
X_train = np.array(trainDataSet["data"])
y_train = np.array(trainDataSet["target"])
X_test = np.array(testDataSet["data"])
y_test = np.array(testDataSet["target"])

print("Train Data Set Shape :\t",format(X_train.shape))
print("Train Label Set Shape :\t",format(y_train.shape))
print("Test Data Set Shape :\t",format(X_test.shape))
print("Test Label Set Shape :\t",format(y_test.shape))
Train Data Set Shape :	 (60000, 784)
Train Label Set Shape :	 (60000,)
Test Data Set Shape :	 (10000, 784)
Test Label Set Shape :	 (10000,)

2. Visualize Dataset

2.1 Used a dictionary(labels) to label each images

labels = {
    0: "T Shirt/Top", 1: "Trouser", 2: "PullOver", 3: "Dress", 4: "Coat", 5: "Sandal",
    6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Angkle Boot"
}
In [5]:
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.figure import Figure

labels = {
    0:"T Shirt/Top", 1: "Trouser", 2:"PullOver", 3:"Dress", 4:"Coat", 5:"Sandal", 6 : "Shirt", 7:"Sneaker",
    8: "Bag", 9: "Angkle Boot"
}

fig = plt.figure(figsize=(15,15))
for i in range(20):
    fig.add_subplot(4,5,i+1)
    plt.title(labels[y_train[i]])
    plt.imshow(X_train[i].reshape(28,28),cmap="binary")
plt.show()

3. Keras Sequential Model with TensorFlow backend Implementation


3.1 Sequential model is good for feed forward CNN

from keras.models import Sequential

3.2 Adding layers that commonly used for neural networks

Sequential_model.add()

3.3 Reshaping of images_dataset(training and testing) is required to match Tensorflow backend

X_train.reshape(number_of_elements, height,width, depth=1)

3.4 Preprocessing data to normalize it - the value will range from [0 - 1] rather than [0 - 255]

X_train /=255

3.5 Classifier model configurations and parameters

model = Sequential()

model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(28,28,1)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
In [3]:
from keras.models import Sequential

from keras.layers import Dense,Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
Using TensorFlow backend.
In [4]:
#reshaping back end for Tensorflow back end usage
# required (number_of_elements, width, height,depth = 1)
X_train = X_train.reshape(X_train.shape[0], 28, 28,1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
print("Train Data Set Shape :\t",format(X_train.shape))
print("Test Data Set Shape :\t",format(X_test.shape))
Train Data Set Shape :	 (60000, 28, 28, 1)
Test Data Set Shape :	 (10000, 28, 28, 1)
In [5]:
#Normalizing data so that value ranges from [0,1]
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

3.6 Making model to map to 10 classes (nodes) at the end of the network

y_train = np_utils.to_categorical(y_train,10)
In [6]:
#Converting 1d array to 10d array
y_train = np_utils.to_categorical(y_train,10)
y_test = np_utils.to_categorical(y_test,10)
print(y_train.shape)
print(y_test.shape)
(60000, 10)
(10000, 10)
In [7]:
model = Sequential()

model.add(Convolution2D(32, 3, 3, activation='relu', input_shape=(28,28,1)))
model.add(Convolution2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
/home/dheeno/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: Update your `Conv2D` call to the Keras 2 API: `Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1...)`
  This is separate from the ipykernel package so we can avoid doing imports until
/home/dheeno/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: UserWarning: Update your `Conv2D` call to the Keras 2 API: `Conv2D(32, (3, 3), activation="relu")`
  after removing the cwd from sys.path.

3.7 After adding required layers, model compiled

Sequential_model.compile()

3.7.1 optimizer='adam' uses learning rate = 0.001

3.7.2 optimizer='Adadelta' uses learning rate = 1.0 but it changes itself during the training to suit the training

3.7.3 metrics=['accuracy'] means accuracy displayed during the training

In [8]:
model.compile(loss='categorical_crossentropy',
              optimizer='Adadelta',
              metrics=['accuracy'])

3.8 Model training starts

3.8.1 epochs = 12 means 12 iterations are done and this value can be changed and set higher

3.8.2 verbose = 1,(default = 0, silent) will display progress bar during training

In [9]:
model.fit(X_train, y_train, 
          batch_size=32, epochs=12, verbose=1, validation_split=0.1)
Train on 54000 samples, validate on 6000 samples
Epoch 1/12
54000/54000 [==============================] - 122s 2ms/step - loss: 0.5012 - acc: 0.8222 - val_loss: 0.3222 - val_acc: 0.8840
Epoch 2/12
54000/54000 [==============================] - 119s 2ms/step - loss: 0.3420 - acc: 0.8803 - val_loss: 0.2993 - val_acc: 0.8965
Epoch 3/12
54000/54000 [==============================] - 112s 2ms/step - loss: 0.2999 - acc: 0.8944 - val_loss: 0.2729 - val_acc: 0.9045
Epoch 4/12
54000/54000 [==============================] - 110s 2ms/step - loss: 0.2752 - acc: 0.9040 - val_loss: 0.2709 - val_acc: 0.9118
Epoch 5/12
54000/54000 [==============================] - 112s 2ms/step - loss: 0.2584 - acc: 0.9097 - val_loss: 0.2367 - val_acc: 0.9205
Epoch 6/12
54000/54000 [==============================] - 126s 2ms/step - loss: 0.2478 - acc: 0.9141 - val_loss: 0.2473 - val_acc: 0.9170
Epoch 7/12
54000/54000 [==============================] - 128s 2ms/step - loss: 0.2397 - acc: 0.9158 - val_loss: 0.2672 - val_acc: 0.9085
Epoch 8/12
54000/54000 [==============================] - 135s 3ms/step - loss: 0.2328 - acc: 0.9200 - val_loss: 0.2345 - val_acc: 0.9167
Epoch 9/12
54000/54000 [==============================] - 147s 3ms/step - loss: 0.2263 - acc: 0.9215 - val_loss: 0.2301 - val_acc: 0.9228
Epoch 10/12
54000/54000 [==============================] - 147s 3ms/step - loss: 0.2212 - acc: 0.9240 - val_loss: 0.2243 - val_acc: 0.9222
Epoch 11/12
54000/54000 [==============================] - 136s 3ms/step - loss: 0.2153 - acc: 0.9255 - val_loss: 0.2194 - val_acc: 0.9260
Epoch 12/12
54000/54000 [==============================] - 143s 3ms/step - loss: 0.2115 - acc: 0.9269 - val_loss: 0.2325 - val_acc: 0.9252
Out[9]:
<keras.callbacks.History at 0x7f4e4d520470>

Notes on Keras Sequential Model Experiment

  • Adam and Adadelta optimizer gave almost same accuracy
  • Accuracy Increased if training_split is set less than 0.2
  • Accuracy slightly increased if epoch value increased but it takes more time to complete training

3.9 Saving model - To save time and easy usage

from keras.models import load_model
model.save('my_model.h5')
model = load_model('my_model.h5') ## loading model

4. Decision Tree Classifier


4.1 Creating Decision Tree Classfier

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

4.2 Classifier model configurations and parameters

clf.fit(trainData, trainlabels)
In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import Normalizer

clf = DecisionTreeClassifier()
Xdc_train = trainDataSet["data"]
ydc_train = trainDataSet["target"]
Xdc_test = testDataSet["data"]
ydc_test = testDataSet["target"]

scaller = Normalizer().fit(Xdc_train)
Xdc_train_norm = scaller.transform(Xdc_train)
Xdc_test_norm = scaller.transform(Xdc_test)

clf.fit(Xdc_train_norm, ydc_train)
Out[19]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Evaluation


5. Prediction Accuracy & Precision

5.1 Sequential Model

seq_prediction = model.predict_classes(X_test,batch_size=32, verbose=1)
seq_accuracy = model.evaluate(X_test,y_test)

Accuracy: 92.38%

5.2 Decision Tree Classifier

from sklearn.metrics import accuracy_score
dt_prediction = clf.predict(Xdc_test)
dt_accuracy = accuracy_score(ydc_test,dt_prediction)

Accuracy: 80.14%

In [24]:
#Sequential Model
seq_prediction = model.predict_classes(X_test,batch_size=32, verbose=1)
seq_accuracy = model.evaluate(X_test,y_test)
print("Accuracy Keras Model: %.2f percent" %(seq_accuracy[1] * 100))
10000/10000 [==============================] - 6s 623us/step
10000/10000 [==============================] - 5s 537us/step
Accuracy Keras Model: 92.38 percent
In [26]:
#DecisionTree Classifier
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
from sklearn.model_selection import cross_val_score

dt_prediction = clf.predict(Xdc_test_norm)
dt_accuracy = accuracy_score(ydc_test,dt_prediction)
print("Accuracy Decision Tree: %.2f percent" %(dt_accuracy * 100))

scores = cross_val_score(clf, Xdc_train_norm, ydc_train, cv=5)
print(scores)
Accuracy Decision Tree: 80.14 percent
[ 0.79133333  0.78475     0.79183333  0.7985      0.78866667]

5.3 Precision Visualization

Get precision

seq_precision = precision_score(ydc_test,seq_prediction, average=None)
dt_precision = precision_score(ydc_test,dt_prediction, average=None)
In [13]:
seq_precision = precision_score(ydc_test,seq_prediction, average=None)
dt_precision = precision_score(ydc_test,dt_prediction, average=None)
In [14]:
labels = ["T Shirt/Top", "Trouser", "PullOver", "Dress", "Coat", "Sandal", "Shirt", "Sneaker",
        "Bag","Angkle Boot"]

fig = plt.figure(figsize=(12,6))
plt.title("Precision Score for Both Models")
plt.ylabel("Percentage /%")
plt.xlabel("Classes")
x_coordinate = [0,1,2,3,4,5,6,7,8,9]
plt.xticks(range(10), labels[:10])
blueLine = plt.plot(x_coordinate,seq_precision * 100, 'b',label="Keras Sequential Model")
greenLine = plt.plot(x_coordinate,dt_precision * 100, 'g',label="Decision Tree Model")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

5.3.1 Precision Comment

From the graph, Shirt recorded lowest precision for both models, while Trouser, Sandal Sneaker, Bag and Angkle Boot recorded high precision. However T Shirt, PullOver, Dress & Coat have low precision. This means, the output of the prediction is has only confidence level around 80% for Keras model and around 70% for decision tree model.

6. Confusion Matrix

6.1 Getting Confusion Matrix

seq_confMat = confusion_matrix(ydc_test,seq_prediction)
dt_confMat = confusion_matrix(ydc_test,dt_prediction)

6.2 Visualizing Confusion Matrix

def plotConfMatrix(norm_conf, title)
In [15]:
seq_confMat = confusion_matrix(ydc_test,seq_prediction)
dt_confMat = confusion_matrix(ydc_test,dt_prediction)
In [27]:
def plotConfMatrix(conf_mat, title):
    norm_conf = []
    for i in conf_mat:
        a = 0
        tmp_arr = []
        a = sum(i, 0)
        for j in i:
            tmp_arr.append(float(j)/float(a))
        norm_conf.append(tmp_arr)

    fig = plt.figure(figsize=(10,10))
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(np.array(norm_conf), cmap=plt.cm.jet, 
                    interpolation='nearest')

    width, height = conf_mat.shape

    for x in range(width):
        for y in range(height):
            ax.annotate(str(conf_mat[x][y]), xy=(y, x), 
                        horizontalalignment='center',
                        verticalalignment='center')
    labels = ["T Shirt/Top", "Trouser", "PullOver", "Dress", "Coat", "Sandal", "Shirt", "Sneaker",
        "Bag","Angkle Boot"]
    plt.title(title)
    plt.xticks(range(width), labels[:width])
    plt.yticks(range(height), labels[:height])
    plt.show()

6.3 Plotting Confusion Matrix

In [28]:
plotConfMatrix(seq_confMat, "Keras Sequential Model")
plotConfMatrix(dt_confMat, "Decision Tree Classfier")

6.4 Confusion Matrix Comments

Most confusion or incorrect results comes from TShirt and Shirt. Pull Over, Coat and Bag also have added confusion to the predictions. Their similar shape/design may contribute to the confusion.

According to Confusion Matrix(Keras Sequential Model), there are 143 over T Shirts mistaken as Shirts. 96 Pull Overs is mistaken as Coat and Dress(45) is mistaken as T Shirt and Shirt. Other mistakes are vise versa mistakes of earlier confusion.

Trouser, Sneaker, Sandal, Dress,Angkle Boot and Bag have less confusion in between. This can be because of their disctinct shapes.

7. BenchMarking


7.1 Keras Sequential Model

7.1.1 Models with 2 Conv Net with Preprocessing are having Accuracy 91.0% and Above

7.1.2 My Model: 92.38%

7.2 Decision Tree Classfier

7.2.1 Model with hightest Accuracy is having Accuracy 81.01%

7.2.2 My Model: 80.14%

8. Observation and Comments on the Implementation


*Keras Sequential Model.*

This model did well. It took 26 minutes to train and 1 minute to predict the test data. The more epoch you have the more training time it will take. While the accuracy improved slightly with number of epochs, it reaches plateau aroun 93% Accuracy. It has got **92.38%** accuracy in predicting correct labels of the test data. According to the precision diagram for this model, the reliability in predicting labels such as Trousers, Sneaker, Sandal, Dress, Angkle Boot and Bag is very high. Hence the differentiating these labels should be very precise for this model.

However, this model is having a hard time in predicting T-Shirt and Shirt. Both prediction for these labels are just over **80%** in precision. These are the lowest among other classes.

**Ways To Improve this model**

This model requires more training in terms of T-Shirt and Shirt labels. One of the ways to improve this model on this particular labels is giving more sample images for the model to train on. However, the amount of data giving is huge and that may not help in increasing the accuracy of the model highly.

Another way, TShirt and Shirt share similar shapes by at least 80%. However, T Shirt has collars while Shirts don't. The question is, can the model take account of this small detail? Hence more research can be done on this part.

From Neural Network Preseptive, adding more hidden layers and neurons can help the model to train more precisely and accurately. This is well demonstrated in TensorFlow Playground. There are more parameters that can be tested and modified in order to increase the effciency of the training of the model. Validation Splits and Data Preprocessing can be tested.

*Decision Tree Classfier.*

This model did good overall, 80% accuracy but did badly if compared to Keras Sequential Classfier. It took 15 minutes to train and 0.5 minute to predict the test data.

This model has performance but lower compared to Keras Sequential Model. It also face same problems just like the Keras Sequential Model.

Ways To Improve this model

This model requires more training in terms of T-Shirt and Shirt labels especially and all other labels too. One of the ways to improve this model on this particular labels is giving more sample images for the model to train on. However, the amount of data giving is huge and that may not help in increasing the accuracy of the model highly.

Decision Tree works on feature by feature basis and improving it might not be as straight forward as it is for Keras Sequential Model. That is why continuing the training of Neural Network Model is recommended. In addition to that, preprocessing of the data has only increase 0.5% Accuracy and that may show it's effectiveness.

9. Suggestion to Improve the System


1. I would like to explore more on optimizers for Keras Sequential Model.
2. Extra whitebackground may have contributed on similaries and I would like to cut it off and start classifier training.
3. I would like to transform data into black and white pixels only, e.i pixels in range [0,255]. This is because what is important, is the shape of the images, not the pattern on top of it. It didn't give me the results I wanted, but I believe this can work.
4. I would like to train the model using KFold. 5. I would like to add more layers to Keras Sequential Model and add more nodes to train it.

Created By Class DataSet
Dheenodara Rao Data Mining Fashion MNIST