Machine learning is the talk of the tech industry, as organisations are faced with an explosion of data and are looking for ways to gain a competitive advantage through monetising this data. Examples of such monetisation are targeted marketing, fraud detection, churn prevention, network analysis, predictive maintenance etc. This series of articles describes various machine-learning techniques using the fantastic open source library Apache Spark MLlib. MLlib is a machine-learning library that transformation of large amounts of data into meaningful patterns and rules that can be used for both statistical and predictive analytics. Combined with Sparks powerful support for horizontally scalable machine learning through Resilient Distributed Data Structures (RDD) , MLlib provides incorporates dozens of algorithms to solve various classification, regression, and clustering problems.
This first part deals with one of the oldest and popular technique of machine learning - Decision Trees. As they are intuitive and easily readable, decision trees are perfect algorithms to start machine learning. We will focus on basic Decision Trees but advise exploring extensions of this technique, which are more likely to be used in practical applications i.e. Ensembles or Multiple Decision Trees (random forests). We will use IPython Notebook to execute load the data set and execute the algorithm. For details on how to install spark on Ubuntu or OS-X, please refer to
We will attempt to use this technique to classification of medical problems in medical diagnoses. In medical diagnoses the role of Classification algorithms such as Decision Trees is very important for medical practitioners because they are very fast to train and produce transparent models. This particular application of classification relates to breast cancer data. Breast cancer is the second most important cause of death among women and is characterised by malignant tumours when cells in the breast tissue divide and grow without normal controls on cell death and cell division. It affects approximately 10% of all women at some period of their life. Mammography screening is the dominant method of early detection of breast cancer. Digital mammogram images are sometimes difficult to read due to their low contrast and differences in the types of tissues and the diagnoses has to be supported with fine needle aspiration cytology in order to extract tissue for a biopsy. However, statistics show that only 20-30 percentages of breast biopsy cases prove cancerous (false positives). However, a false negative carries much greater risk in terms of potential loss of human life. Therefore, it is necessary to develop better identification methods to recognise breast cancer. Machine learning methods can help to reduce the number of false positives and false negative decisions [1].
The objective is to assign patients to either a ‘Benign’ group/class that does not have breast cancer or a ‘Malignant’ group that has strong evidence of having breast cancer based on cytological proven tumour data. For this illustration, we will use the Wisconsin Breast Cancer Data set. The attributes in the data set are: Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses, and Class (Benign or Malignant)
We will first load the csv into a raw data set and split it into the attribute fields.
path = "breast-cancer-wisconsin.csv"
raw_data = sc.textFile(path)
num_data = raw_data.count()
data = raw_data.map(lambda x: x.split(","))
first = data.first()
print first
print num_data
Out:
[u'5', u'1', u'1', u'1', u'2', u'1', u'3', u'1', u'1', u'2']
683
We will next parse the raw data to create an RDD that we can subsequently use to create an the training, validation and test data sets. This is essential to overcome the problem of 'over-fitting: If we supply too much training data, the model will actually be created perfectly, but just for that data. The model should be able to predict future unknowns and not perfectly predict values that are already known. This is why we do a random split of the raw data set into 2 buckets - training (create the model) and test (validate the model).
Initially we translate the raw records into labelled points, which are key-value pairs. The key or label is the target classification or class of the observation and the values is a feature vector stored as an array. The feature values as converted to floats. Note that decision tree models can work on raw features i.e. it does not matter if the feature is categorical (Country: US, UK) or numerical (Income: 5000). In other methods, which we will discuss in later articles, the categorical features need to be encoded to convert into a numeric feature and reduce the side effects such as intrinsic ordering.
from pyspark.mllib.regression import LabeledPoint
import numpy as np
def extract_features_dt(record):
return np.array(map(float, record[0:9]))
def extract_label(record):
return float(record[-1])
data_dt = data.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))
first_point_dt = data_dt.first()
print "Decision Tree feature vector: " + str(first_point_dt.features)
print "Decision Tree feature vector length: " + str(len(first_point_dt.features))
train_data_dt, test_data_dt = data_dt.randomSplit([0.7, 0.3])
train_data_dt.cache()
test_data_dt.cache
The following line illustrates the first feature vector which has 9 numeric features.
Out:
Decision Tree feature vector: [5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0]
Decision Tree feature vector length: 9
We will first train a classification tree including every single feature. We can then use our results for model selection and perform model pruning. Pruning, as the name suggests, involves removing branches of the classification tree. This again is to avoid over-fitting. As the data set grows larger and the number of features grows, the classification trees can get increasingly complex (depending on the depth and number of buckets configured for the model creation). We want our tree to be as simple as possible, with as few nodes and depth as possible without sacrificing too much accuracy. This can be a balancing act and leave it as an exercise for the reader (http://spark.apache.org/docs/latest/mllib-decision-tree.html)
We now proceed with building the decision model. Here, the use of trainClassifier method of DecisionTree instead of trainRegressor is because the class or target of the LabeledPoint should be treated as a distinct category number, not a numeric feature value.
from pyspark.mllib.tree import DecisionTree
dt_model = DecisionTree.trainClassifier(train_data_dt,5,{},"gini",4,10)
print dt_model.toDebugString()
The resultant decision model is traced below. You can experiment with different depths and maxBins to see how it impacts the complexity and time required to train the model.
Out:
DecisionTreeModel classifier of depth 4 with 25 nodes
If (feature 1 <= 3.0)
If (feature 5 <= 5.0)
If (feature 5 <= 2.0)
Predict: 2.0
Else (feature 5 > 2.0)
If (feature 2 <= 2.0)
Predict: 2.0
Else (feature 2 > 2.0)
Predict: 4.0
Else (feature 5 > 5.0)
If (feature 0 <= 1.0)
Predict: 2.0
Else (feature 0 > 1.0)
If (feature 8 <= 2.0)
Predict: 4.0
Else (feature 8 > 2.0)
Predict: 4.0
Else (feature 1 > 3.0)
If (feature 3 <= 1.0)
If (feature 0 <= 6.0)
If (feature 6 <= 3.0)
Predict: 2.0
Else (feature 6 > 3.0)
Predict: 4.0
Else (feature 0 > 6.0)
Predict: 4.0
Else (feature 3 > 1.0)
If (feature 1 <= 4.0)
If (feature 5 <= 7.0)
Predict: 4.0
Else (feature 5 > 7.0)
Predict: 4.0
Else (feature 1 > 4.0)
If (feature 7 <= 2.0)
Predict: 4.0
Else (feature 7 > 2.0)
Predict: 4.0
Finally we can use the created model against the test set to predict the class ("Benign=2" or "Malignant=4") of the test observation.
preds = dt_model.predict(test_data_dt.map(lambda p: p.features))
actual = test_data_dt.map(lambda p: p.label)
true_vs_predicted_dt = actual.zip(preds)
print "Decision Tree predictions: " + str(true_vs_predicted_dt.take(50))
print "Decision Tree depth: " + str(dt_model.depth())
print "Decision Tree number of nodes: " + str(dt_model.numNodes())
Out:
Decision Tree predictions: [(2.0, 4.0), (2.0, 2.0), (4.0, 4.0), (4.0, 4.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (4.0, 4.0), (4.0, 4.0), (2.0, 2.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 2.0), (4.0, 2.0), (4.0, 4.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (4.0, 4.0), (2.0, 2.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (4.0, 4.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (4.0, 4.0), (4.0, 4.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (2.0, 2.0), (4.0, 4.0)]
Decision Tree depth: 4
Decision Tree number of nodes: 25
Is the error is low enough?
te_count = true_vs_predicted_dt.count()
err_1 = true_vs_predicted_dt.filter(lambda (v, p): v != p).count() / float(te_count)
print "Error percentage: " + str(err_1)
In this case, the model seems to be about 97% accurate, which is pretty good for a prediction model.
Out:
Error percentage: 3.30188679245%
Based on a quick scan of the main splits in the decision tree, we will illustrate the classification using a 3D plot of 3 significant parameters (Clump Thickness, Uniformity of Cell Size, Bare Nuclei). The colours indicate the predicted class - Blue for Benign and Red for Malignant. A word of caution: it is certainly not accurate to say that these are the most significant features as there are other splits in the tree and tuning the model may reveal other significant features. It is left as an exercise to try pruning the model and build a minimal tree using just a few features and compare the performance and accuracy against the full set of predictors and different model hyper-parameters.
References:
[1]: Diagnosis of Breast Cancer using Decision Tree Models and SVM, Alaa. M. Elsayad,H. A. Elsalamony
Can anyone please brief how this feature set is generated (.csv).
ReplyDelete