2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
IBM Watson Studio Upgrade
Machine Learning - Project Notebook
VERSION AUTHOR LAST UPDATED LANGUAGE
Jeyashri Ganesan 30 Jan 2019, 5:29 PM Python 3.5
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 1/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
IBM Watson Studio Upgrade
(https://www.bigdatauniversity.com)
Classification with Python
In this notebook we try to practice all the classification algorithms that we learned in this course.
We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.
Lets first load required libraries:
In [2]: import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
About dataset
This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes
following fields:
Field Description
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 2/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM Watson Studio Loan_status Whether a loan is paid off on in collection Upgrade Dr Mohammad Ayoub K… DK
Principal Basic principal loan amount at the
Terms Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date When the loan got originated and took effects
Due_date Since it’s one-time payoff schedule, each loan has one single due date
Age Age of applicant
Education Education of applicant
Gender The gender of applicant
Lets download the dataset
In [3]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv
--2019-01-30 10:39:14-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’
100%[======================================>] 23,101 --.-K/s in 0.002s
2019-01-30 10:39:14 (13.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]
Load Data From CSV File 3/39
In [4]: df = pd.read_csv('loan_train.csv')
df.head()
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
Out[4]: Dr Meodhuacmatmioand AGyeonudbeKr… DK
IBM Watson StuUdnionamed: 0 Unnamed: 0.1 loan_status Principal terms effective_dateUpdgruaed_edate age
00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female
46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male
In [5]: df.shape
Out[5]: (346, 10)
Convert to date time object
In [6]: df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()
Out[6]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
00 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below male
12 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor female
23 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college male
34 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college female
46 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college male
Data visualization and pre-processing 4/39
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
Let’s see how many of each class is in our data set Upgrade
IBM Watson Studio
In [7]: df['loan_status'].value_counts()
Out[7]: PAIDOFF 260
COLLECTION 86
Name: loan_status, dtype: int64
260 people have paid off the loan on time while 86 have gone into collection
Lets plot some columns to underestand data better:
In [8]: # notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y
Fetching package metadata .............
Solving package specifications: .
Package plan for installation in environment /opt/conda/envs/DSX-Python35:
The following packages will be UPDATED:
seaborn: 0.8.0-py35h15a2772_0 --> 0.9.0-py35_0 anaconda
seaborn-0.9.0- 100% |################################| Time: 0:00:00 6.79 MB/s
In [9]: import seaborn as sns
bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 5/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
IBM Watson Studio Upgrade
In [13]: bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
Pre-processing: Feature selection/extraction 6/39
Lets look at the day of the week people get the loan
In [14]: df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan status", palette="Set1", col wrap=2)
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 g sns.FacetGrid(df, col Gender , hue Mlaocahnin_esLteaartniunsg - ,PropjeactlNeottetbeook S- IeBtM1W,atsocnol_wrap 2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k") Upgrade Dr Mohammad Ayoub K… DK
IBM WatsongS.tauxdeiso[-1].legend()
plt.show()
We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4
In [15]: df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
df.head()
Out[15]:
Unnamed: Unnamed: loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0.1
00 0 PAIDOFF 1000 30 2016-09-08 2016-10- 45 High male 3 0
07 School or
Below
12 2 PAIDOFF 1000 30 2016-09-08 2016-10- 33 Bechalor female 3 0
07
23 3 PAIDOFF 1000 15 2016-09-08 2016-09- 27 college male 3 0
22
34 4 PAIDOFF 1000 30 2016-09-09 2016-10- 28 college female 4 1
08
https://dataplatform.cloud.ibm4.com6/analytics/noteb6ooks/v2/9336fa5PdA-dIdDb2O-4FcF18-b57e1-0040801a3d20e23d0/view?ac2c0es1s6_t-o0k9en-0=f915d66bb02808912f652-18901-c7e24c95dd4cdod4ll5ef0gfe697d9ffc0me6a5lbeb4bb7d49199ebad4592 1 7/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson g DK
Dr Mohammad Ayoub K…
IBM Watson Studio 08
Upgrade
Convert Categorical features to numerical values
Lets look at gender:
In [16]: df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
Out[16]: Gender loan_status
female PAIDOFF 0.865385
COLLECTION 0.134615
male PAIDOFF 0.731293
COLLECTION 0.268707
Name: loan_status, dtype: float64
86 % of female pay there loans while only 73 % of males pay there loan
Lets convert male to 0 and female to 1:
In [17]: df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
df.head()
Out[17]:
Unnamed: Unnamed: loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0.1
00 0 PAIDOFF 1000 30 2016-09-08 2016-10- 45 High 0 3 0
07 School or
Below
12 2 PAIDOFF 1000 30 2016-09-08 2016-10- 33 Bechalor 1 3 0
07
23 3 PAIDOFF 1000 15 2016-09-08 2016-09- 27 college 0 3 0
22
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 8/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM Watson 3Stu4dio 4 PAIDOFF 1000 30 2016-09-09 02081U6p-1g0ra-de28 college 1Dr Moham4mad Ayoub K1… DK
46 6 PAIDOFF 1000
30 2016-09-09 2016-10- 29 college 04 1
08
One Hot Encoding
How about education?
In [18]: df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Out[18]: education loan_status
Bechalor PAIDOFF 0.750000
0.250000
COLLECTION 0.741722
0.258278
High School or Below PAIDOFF 0.500000
0.500000
COLLECTION 0.765101
0.234899
Master or Above COLLECTION
PAIDOFF
college PAIDOFF
COLLECTION
Name: loan_status, dtype: float64
Feature befor One Hot Encoding
In [19]: df[['Principal','terms','age','Gender','education']].head()
Out[19]:
Principal terms age Gender education
0 1000 30 45 0 High School or Below
1 1000 30 33 1 Bechalor
2 1000 15 27 0 college
3 1000 30 28 1 college
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 9/39
2/19/2019 30 29 0 college Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
IBM Watson 4Stu1d0i0o0 Upgrade
Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame
In [20]: Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()
Out[20]:
Principal terms age Gender weekend Bechalor High School or Below college
0 1000 30 45 0 0 0 1 0
1 1000 30 33 1 0 1 0 0
2 1000 15 27 0 0 0 0 1
3 1000 30 28 1 1 0 0 1
4 1000 30 29 0 1 0 0 1
Feature selection
Lets defind feature sets, X:
In [21]: X = Feature
X[0:5]
Out[21]:
Principal terms age Gender weekend Bechalor High School or Below college
0 1000 30 45 0 0 0 1 0
1 1000 30 33 1 0 1 0 0
2 1000 15 27 0 0 0 0 1
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 10/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM Watson 3Stu1d0i0o0 30 28 1 1 00 Upg1rade Dr Mohammad Ayoub K… DK
4 1000 30 29 0 1 00 1
What are our lables?
In [22]: y = df['loan_status'].values
y[0:5]
Out[22]: array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)
Normalize Data
Data Standardization give data zero mean and unit variance (technically should be done after train test split )
In [23]: X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
Out[23]: array([[ 0.51578458, 0.92071769, 2.33152555, -0.42056004, -1.20577805,
-0.38170062, 1.13639374, -0.86968108],
[ 0.51578458, 0.92071769, 0.34170148, 2.37778177, -1.20577805,
2.61985426, -0.87997669, -0.86968108],
[ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
-0.38170062, -0.87997669, 1.14984679],
[ 0.51578458, 0.92071769, -0.48739188, 2.37778177, 0.82934003,
-0.38170062, -0.87997669, 1.14984679],
[ 0.51578458, 0.92071769, -0.3215732 , -0.42056004, 0.82934003,
-0.38170062, -0.87997669, 1.14984679]])
Classification
Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following 11/39
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
algorithm: Upgrade
IBM Watson Studio
K Nearest Neighbor(KNN)
Decision Tree
Support Vector Machine
Logistic Regression
Notice:
You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
You should include the code of the algorithm in the following cells.
K Nearest Neighbor(KNN)
Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.
In [156]: import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
In [157]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv
--2019-01-30 11:57:45-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E 12/39
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Length: 23101 (23K) [text/csv] Machine Learning - Project Notebook - IBM Watson
IBM WatsonSSatvuidnigo to: ‘loan_train.csv’ Upgrade Dr Mohammad Ayoub K… DK
100%[======================================>] 23,101 --.-K/s in 0.002s
2019-01-30 11:57:45 (14.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]
In [158]: df = pd.read_csv('loan_train.csv')
df.head()
Out[158]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female
46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male
In [159]: X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/utils/validation.py:475: DataConversionWarning:
Data with input dtype object was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
Out[159]: array([[ 0.58, 0.52, 0.92, 2.6 , -0.42, 2.33, -0.65, 0.42],
[ 0.58, 0.52, 0.92, 2.6 , -0.42, 0.34, -1.52, -2.38],
[ 0.58, 0.52, -0.96, 2.6 , 0.72, -0.65, 1.1 , 0.42],
[ 0.58, 0.52, 0.92, 3.39, -0.31, -0.49, 1.1 , -2.38],
[ 0.58, 0.52, 0.92, 3.39, -0.31, -0.32, 1.1 , 0.42]])
In [160]: from sklearn.model_selection import train_test_split 13/39
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Train set: (276, 8) (276,)
Test set: (70 8) (70 )
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
Test set: (70, 8) (70,)
IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK
In [161]: from sklearn.neighbors import KNeighborsClassifier
In [162]: k = 6
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
Out[162]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=6, p=2,
weights='uniform')
In [163]: yhat = neigh.predict(X_test)
yhat[0:5]
Out[163]: array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)
In [164]: from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
Train set Accuracy: 0.996376811594
Test set Accuracy: 1.0
In [165]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Out[165]: 1.0
In [166]: print (classification_report(y_test, yhat))
precision recall f1-score support
COLLECTION 1.00 1.00 1.00 15
PAIDOFF 1.00 1.00 1.00 55
avg / total 1.00 1.00 1.00 70
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 14/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
DeIBcM iWsaitsoonnStTudrioee Upgrade
In [139]: import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
In [140]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv
--2019-01-30 11:55:38-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’
100%[======================================>] 23,101 --.-K/s in 0.002s
2019-01-30 11:55:38 (14.7 MB/s) - ‘loan_train.csv’ saved [23101/23101]
In [141]: my_data = pd.read_csv("loan_train.csv", delimiter=",")
my_data[0:5]
Out[141]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female
46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male
In [142]: X = my_data[['loan_status', 'Principal', 'terms', 'effective_date', 'due_date', 'age', 'education', 'Gender']].val
ues
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 15/39
2/19/2019 ues Machine Learning - Project Notebook - IBM Watson
X[0:7] Upgrade Dr Mohammad Ayoub K… DK
IBM Watson Studio
Out[142]: array([['PAIDOFF', 1000, 30, '9/8/2016', '10/7/2016', 45,
'High School or Below', 'male'],
['PAIDOFF', 1000, 30, '9/8/2016', '10/7/2016', 33, 'Bechalor',
'female'],
['PAIDOFF', 1000, 15, '9/8/2016', '9/22/2016', 27, 'college', 'male'],
['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 28, 'college',
'female'],
['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 29, 'college', 'male'],
['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 36, 'college', 'male'],
['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 28, 'college', 'male']], dtype=object)
In [143]: from sklearn import preprocessing
le_loan_date = preprocessing.LabelEncoder()
le_loan_date.fit(['COLLECTION', 'PAIDOFF'])
X[:,0] = le_loan_date.transform(X[:,0])
le_eff_date = preprocessing.LabelEncoder()
le_eff_date.fit(['9/8/2016', '9/9/2016', '9/10/2016', '9/11/2016', '9/12/2016', '9/13/2016', '9/14/2016'])
X[:,3] = le_eff_date.transform(X[:,3])
le_due_date = preprocessing.LabelEncoder()
le_due_date.fit(['9/16/2016', '9/17/2016', '9/18/2016', '9/19/2016', '9/22/2016', '9/23/2016', '9/24/2016', '9/25/
2016', '9/26/2016', '9/27/2016', '9/28/2016',
'10/7/2016', '10/8/2016', '10/9/2016', '10/10/2016', '10/11/2016', '10/12/2016', '10/13/2016', '1
0/25/2016', '10/26/2016', '11/9/2016', '11/10/2016',
'11/12/2016'])
X[:,4] = le_due_date.transform(X[:,4])
le_education = preprocessing.LabelEncoder()
le_education.fit(['Bechalor', 'High School or Below', 'college', 'Master or Above'])
X[:,6] = le_education.transform(X[:,6])
le_Gender = preprocessing.LabelEncoder()
le_Gender.fit(['female', 'male'])
X[:,7] = le_Gender.transform(X[:,7])
X[0:7] 16/39
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
Out[143]: array([[1, 1000, 30, 5, 6, 45, 1, 1], Upgrade Dr Mohammad Ayoub K… DK
IBM Watson Studio [1, 1000, 30, 5, 6, 33, 0, 0],
17/39
[1, 1000, 15, 5, 16, 27, 3, 1],
[1, 1000, 30, 6, 7, 28, 3, 0],
[1, 1000, 30, 6, 7, 29, 3, 1],
[1, 1000, 30, 6, 7, 36, 3, 1],
[1, 1000, 30, 6, 7, 28, 3, 1]], dtype=object)
In [144]: y = my_data["loan_status"]
y[0:7]
Out[144]: 0 PAIDOFF
1 PAIDOFF
2 PAIDOFF
3 PAIDOFF
4 PAIDOFF
5 PAIDOFF
6 PAIDOFF
Name: loan_status, dtype: object
In [145]: from sklearn.model_selection import train_test_split
In [146]: X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)
In [147]: X_trainset.shape
y_trainset.shape
Out[147]: (242,)
In [148]: X_testset.shape
y_testset.shape
Out[148]: (104,)
In [149]: LoanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
LoanTree
Out[149]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min samples leaf=1 min samples split=2
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM Watson Studio min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_stUaptger=aNdoene, Dr Mohammad Ayoub K… DK
splitter='best')
18/39
In [150]: LoanTree.fit(X_trainset,y_trainset)
Out[150]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
In [151]: predTree = LoanTree.predict(X_testset)
In [152]: print (predTree [0:5])
print (y_testset [0:5])
['PAIDOFF' 'PAIDOFF' 'COLLECTION' 'COLLECTION' 'PAIDOFF']
73 PAIDOFF
24 PAIDOFF
282 COLLECTION
295 COLLECTION
163 PAIDOFF
Name: loan_status, dtype: object
In [153]: from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
DecisionTrees's Accuracy: 1.0
In [154]: print (classification_report(y_test, yhat))
precision recall f1-score support
1 0.79 1.00 0.88 55
15
2 0.00 0.00 0.00
avg / total 0.62 0.79 0.69 70
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricW DK
IBM WatsonaSrtnuidnigo: Precision and F-score are ill-defined and being set to 0.0Upignraldaebels with no preDdriMctoehdamsmamapdlAeyso.ub K…
'precision', 'predicted', average, warn_for)
In [155]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Out[155]: 0.7857142857142857
In [112]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-112-aa589220b1ce> in <module>()
3 featureNames = my_data.columns[0:5]
4 targetNames = my_data["loan_status"].unique().tolist()
----> 5 out=tree.export_graphviz(LoanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y
_trainset), filled=True, special_characters=True,rotate=False)
6 graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
7 graph.write_png(filename)
NameError: name 'tree' is not defined
Support Vector Machine
In [50]: import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
In [51]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv
--2019-01-30 10:47:39-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E 19/39
Nv3/labs/loan train csv
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Nv3/labs/loan_train.csv Machine Learning - Project Notebook - IBM Watson
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193 DK
IBM WatsonCSotnundeicoting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.usU-pggeroa.doebjectstorage.sofDtrlMayoehra.mnmeta)d|A6y7o.u2b28K.…254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’
100%[======================================>] 23,101 --.-K/s in 0.002s
2019-01-30 10:47:39 (11.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]
In [53]: loan_df = pd.read_csv("loan_train.csv")
loan_df.head()
Out[53]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female
46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male
In [58]: ax = loan_df[loan_df['loan_status'] == 'PAIDOFF'][0:50].plot(kind='scatter', x='age', y='Principal', color='DarkBl
ue', label='Paidoff');
loan_df[loan_df['loan_status'] == 'COLLECTION'][0:50].plot(kind='scatter', x='age', y='Principal', color='Yellow',
label='Collection', ax=ax);
plt.show()
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 20/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
IBM Watson Studio Upgrade
In [59]: loan_df.dtypes int64
int64
Out[59]: Unnamed: 0 object
Unnamed: 0.1 int64
loan_status int64
Principal object
terms object
effective_date int64
due_date object
age object
education
Gender
dtype: object
In [86]: import pandas as pd
file_handler = open("loan_train.csv", 'r')
mydata = pd.read_csv(file_handler, sep = ",")
file_handler.close()
gender = {'male': 1,'female': 2}
mydata.Gender = [gender[item] for item in mydata.Gender]
loanstatus = {'PAIDOFF': 1,'COLLECTION': 2}
mydata.loan_status = [loanstatus[item] for item in mydata.loan_status]
education1 = {'High School or Below': 1, 'college': 2, 'Bechalor': 3, 'Master or Above': 4}
mydata.education = [education1[item] for item in mydata.education]
print(mydata)
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \
0 0 0 1 1000 30 9/8/2016
1 2 2 1 1000 30 9/8/2016
2 3 3 1 1000 15 9/8/2016
3 4 4 1 1000 30 9/9/2016
4 6 6 1 1000 30 9/9/2016
5 7 7 1 1000 30 9/9/2016
6 8 8 1 1000 30 9/9/2016
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 21/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
6 8 8 1 1000 30 9/9/2016
IBM Watson78Studio 9 9 1 800 15 U99p//g11r00a//d22e001166 Dr Mohammad Ayoub K… DK
10 10 1 300 7
22/39
9 11 11 1 1000 15 9/10/2016
10 12 12 1 1000 30 9/10/2016
11 13 13 1 900 7 9/10/2016
12 14 14 1 1000 7 9/10/2016
13 15 15 1 800 15 9/10/2016
14 16 16 1 1000 30 9/10/2016
15 17 17 1 1000 15 9/10/2016
16 18 18 1 1000 30 9/10/2016
17 19 19 1 800 30 9/10/2016
18 20 20 1 1000 30 9/10/2016
19 22 22 1 1000 30 9/10/2016
20 23 23 1 1000 15 9/10/2016
21 25 25 1 1000 30 9/10/2016
22 26 26 1 800 15 9/10/2016
23 27 27 1 1000 15 9/10/2016
24 28 28 1 1000 30 9/11/2016
25 29 29 1 1000 30 9/11/2016
26 30 30 1 800 15 9/11/2016
27 31 31 1 1000 30 9/11/2016
28 32 32 1 1000 30 9/11/2016
29 33 33 1 1000 30 9/11/2016
.. ... ... ... ... ... ...
316 367 367 2 800 15 9/11/2016
317 368 368 2 1000 30 9/11/2016
318 371 371 2 1000 30 9/11/2016
319 372 372 2 800 15 9/11/2016
320 373 373 2 1000 15 9/11/2016
321 374 374 2 1000 30 9/11/2016
322 375 375 2 800 15 9/11/2016
323 376 376 2 1000 15 9/11/2016
324 377 377 2 1000 30 9/11/2016
325 378 378 2 1000 30 9/11/2016
326 379 379 2 1000 30 9/11/2016
327 380 380 2 1000 30 9/11/2016
328 381 381 2 800 15 9/11/2016
329 382 382 2 1000 15 9/11/2016
330 383 383 2 1000 30 9/11/2016
331 384 384 2 1000 30 9/11/2016
332 385 385 2 1000 30 9/11/2016
333 386 386 2 800 15 9/11/2016
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 333 386 386 M2achine Learn8in0g0- Project N1o5tebook - IB9M/W11at/so2n016
387 387
334 388 388 2 1000 30 9/11/2016
IBM Watson3S3t5udio 389 389 U9p/g1r1a/d2e016 Dr Mohammad Ayoub K… DK
390 390 2 1000 15
391 391 23/39
336 392 392 2 1000 15 9/11/2016
393 393
337 394 394 2 1000 15 9/11/2016
395 395
338 397 397 2 800 15 9/11/2016
398 398
339 399 399 2 1000 30 9/11/2016
340 2 1000 30 9/11/2016
341 2 800 15 9/11/2016
342 2 1000 30 9/11/2016
343 2 800 15 9/12/2016
344 2 1000 30 9/12/2016
345 2 1000 30 9/12/2016
due_date age education Gender
0 10/7/2016 45 11
1 10/7/2016 33 32
2 9/22/2016 27 21
3 10/8/2016 28 22
4 10/8/2016 29 21
5 10/8/2016 36 21
6 10/8/2016 28 21
7 9/24/2016 26 21
8 9/16/2016 29 21
9 10/9/2016 39 11
10 10/9/2016 26 21
11 9/16/2016 26 22
12 9/16/2016 27 11
13 9/24/2016 26 21
14 10/9/2016 40 11
15 9/24/2016 32 11
16 10/9/2016 32 11
17 10/9/2016 26 21
18 10/9/2016 26 21
19 10/9/2016 25 11
20 9/24/2016 26 21
21 10/9/2016 29 11
22 9/24/2016 39 31
23 9/24/2016 34 31
24 10/10/2016 31 21
25 10/10/2016 33 21
26 9/25/2016 33 11
27 10/10/2016 37 21
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 0/ 0/ 0 6 3 Machine Learning - Project Notebook - IBM Watson
27
28 10/10/2016 37 21
IBM Watson2S9tudi1o0/10/2016 ... 21 Upgrade Dr Mohammad Ayoub K… DK
28 ... ...
.. ... 24 31 24/39
18 21
316 9/25/2016 25 21
40 11
317 10/10/2016 29 11
26 21
318 10/10/2016 30 12
33 21
319 9/25/2016 30 21
32 21
320 9/25/2016 25 21
35 11
321 10/10/2016 30 11
26 31
322 9/25/2016 29 11
26 11
323 9/25/2016 46 11
36 11
324 10/10/2016 38 11
32 31
325 10/10/2016 30 11
35 21
326 10/10/2016 29 11
26 22
327 10/10/2016 32 21
25 11
328 9/25/2016 39 11
28 21
329 9/25/2016 26 21
21
330 10/10/2016
331 10/10/2016
332 11/9/2016
333 9/25/2016
334 10/10/2016
335 9/25/2016
336 10/25/2016
337 9/25/2016
338 9/25/2016
339 10/10/2016
340 11/9/2016
341 9/25/2016
342 10/10/2016
343 9/26/2016
344 11/10/2016
345 10/11/2016
[346 rows x 10 columns]
In [95]: mydata.drop(mydata.columns[[5,6]], axis=1, inplace=True)
In [96]: mydata.dtypes
Out[96]: Unnamed: 0 int64
https://dataplatform.cloud.ibmU.com/anadlytics0/no1tebooksi/v2t/963436fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 int64 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
int64
Unnamed: 0.1 int64 Upgrade 25/39
IBM WatsonlSotaund_isotatus int64
int64
Principal int64
terms int64
age
education
Gender
dtype: object
In [97]: feature_df = mydata[['loan_status', 'Principal', 'terms', 'age', 'education', 'Gender']]
X = np.asarray(feature_df)
X[0:5]
Out[97]: array([[ 1, 1000, 30, 45, 1, 1],
[ 1, 1000, 30, 33, 3, 2],
[ 1, 1000, 15, 27, 2, 1],
[ 1, 1000, 30, 28, 2, 2],
[ 1, 1000, 30, 29, 2, 1]])
In [90]: mydata['loan_status'] = mydata['loan_status'].astype('int')
y = np.asarray(mydata['loan_status'])
y [0:5]
Out[90]: array([1, 1, 1, 1, 1])
In [98]: X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Train set: (276, 6) (276,)
Test set: (70, 6) (70,)
In [99]: from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
Out[99]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [100]: yhat clf predict(X test)
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
In [100]: yhat = clf.predict(X_test) Upgrade 26/39
IBM WatsonyShtautdi[o0:5]
Out[100]: array([1, 1, 1, 1, 1])
In [109]: from sklearn.metrics import classification_report, confusion_matrix
import itertools
In [111]: def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show
In [113]: cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM WatsonpSrtiundtio(classification_report(y_test, yhat)) Upgrade Dr Mohammad Ayoub K… DK
# Plot non-normalized confusion matrix title='Confusion matri
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF(1)','COLLECTION(2)'],normalize= False,
x')
precision recall f1-score support
1 0.90 1.00 0.95 55
15
2 1.00 0.60 0.75
avg / total 0.92 0.91 0.91 70
Confusion matrix, without normalization
[[9 0]
[0 0]]
In [114]: from sklearn.metrics import f1_score 27/39
f1_score(y_test, yhat, average='weighted')
Out[114]: 0.90578817733990147
https://dataIplatfo[rm11.c5lo]ud.ibmf.com/anaklyltics/notebookts/vi2/9336ifa5d-ddtb2-j4c18-b57de-048i1ai3dl20ei2dt/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
In [115]: from sklearn.metrics import jaccard_similarity_score
IBM WatsonjSatcucdairod_similarity_score(y_test, yhat) Upgrade Dr Mohammad Ayoub K… DK
Out[115]: 0.91428571428571426
Logistic Regression
In [116]: import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt
In [117]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv
--2019-01-30 11:36:35-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’
100%[======================================>] 23,101 --.-K/s in 0.002s
2019-01-30 11:36:35 (12.6 MB/s) - ‘loan_train.csv’ saved [23101/23101]
In [118]: loan_df = pd.read_csv("loan_train.csv")
loan_df.head()
Out[118]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male
12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 28/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
IBM Watson 3Stu4dio 4 PAIDOFF
6 PAIDOFF 1000 30 9/9/2016 Up1g0ra/8d/e2016 28 college Dr Mohammad AfeymouableK… DK
46
1000 30 9/9/2016 10/8/2016 29 college male 29/39
In [119]: import pandas as pd
gender = {'male': 1,'female': 2}
loan_df.Gender = [gender[item] for item in loan_df.Gender]
loanstatus = {'PAIDOFF': 1,'COLLECTION': 2}
loan_df.loan_status = [loanstatus[item] for item in loan_df.loan_status]
education1 = {'High School or Below': 1, 'college': 2, 'Bechalor': 3, 'Master or Above': 4}
loan_df.education = [education1[item] for item in loan_df.education]
loan_df.drop(loan_df.columns[[5,6]], axis=1, inplace=True)
print(loan_df)
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms age education \
0 0 0 1 1000 30 45 1
1 2 2 1 1000 30 33 3
2 3 3 1 1000 15 27 2
3 4 4 1 1000 30 28 2
4 6 6 1 1000 30 29 2
5 7 7 1 1000 30 36 2
6 8 8 1 1000 30 28 2
7 9 9 1 800 15 26 2
8 10 10 1 300 7 29 2
9 11 11 1 1000 15 39 1
10 12 12 1 1000 30 26 2
11 13 13 1 900 7 26 2
12 14 14 1 1000 7 27 1
13 15 15 1 800 15 26 2
14 16 16 1 1000 30 40 1
15 17 17 1 1000 15 32 1
16 18 18 1 1000 30 32 1
17 19 19 1 800 30 26 2
18 20 20 1 1000 30 26 2
19 22 22 1 1000 30 25 1
20 23 23 1 1000 15 26 2
21 25 25 1 1000 30 29 1
22 26 26 1 800 15 39 3
23 27 27 1 1000 15 34 3
24 28 28 1 1000 30 31 2
https://dataplatform.cloud.ibm2.5com/analytics/noteb2o9oks/v2/9336fa5d-ddb22-94c18-b57e-0481a3d210e2d/view?a1c0ce0s0s_token=f3105d66bb38392f52891c7e4c5d2d4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
25 29 29 1 1000 30 33 2
IBM Watson2S6tudio 30
31 30 1 800 15 33 Upgrade 1 Dr Mohammad Ayoub K… DK
27 32 31 1 1000 30 37 2
28 33 30/39
29 ... 32 1 1000 30 27 2
.. 367
316 368 33 1 1000 30 37 2
317 371
318 372 ... ... ... ... ... ...
319 373
320 374 367 2 800 15 28 3
321 375
322 376 368 2 1000 30 24 2
323 377
324 378 371 2 1000 30 18 2
325 379
326 380 372 2 800 15 25 1
327 381
328 382 373 2 1000 15 40 1
329 383
330 384 374 2 1000 30 29 2
331 385
332 386 375 2 800 15 26 1
333 387
334 388 376 2 1000 15 30 2
335 389
336 390 377 2 1000 30 33 2
337 391
338 392 378 2 1000 30 30 2
339 393
340 394 379 2 1000 30 32 2
341 395
342 397 380 2 1000 30 25 1
343 398
344 399 381 2 800 15 35 1
345
382 2 1000 15 30 3
383 2 1000 30 26 1
384 2 1000 30 29 1
385 2 1000 30 26 1
386 2 800 15 46 1
387 2 1000 30 36 1
388 2 1000 15 38 3
389 2 1000 15 32 1
390 2 1000 15 30 2
391 2 800 15 35 1
392 2 1000 30 29 2
393 2 1000 30 26 2
394 2 800 15 32 1
395 2 1000 30 25 1
397 2 800 15 39 2
398 2 1000 30 28 2
399 2 1000 30 26 2
Gender
01
12
21
32
41
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
41
IBM Watson65Studio 1 Upgrade Dr Mohammad Ayoub K… DK
1
31/39
71
81
91
10 1
11 2
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 1
27 1
28 1
29 1
.. ...
316 1
317 1
318 1
319 1
320 1
321 1
322 2
323 1
324 1
325 1
326 1
327 1
328 1
329 1
330 1
331 1
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 331 1 Machine Learning - Project Notebook - IBM Watson
1
IBM Watson33S33t23udio 1 Upgrade Dr Mohammad Ayoub K… DK
1
334 1 32/39
1
335 1
1
336 2
1
337 1
1
338 1
1
339 1
340
341
342
343
344
345
[346 rows x 8 columns]
In [120]: loan_df.dtypes
Out[120]: Unnamed: 0 int64
Unnamed: 0.1 int64
loan_status int64
Principal int64
terms int64
age int64
education int64
Gender int64
dtype: object
In [121]: X = np.asarray(loan_df[['Principal', 'terms', 'age', 'education', 'Gender']])
X[0:5]
Out[121]: array([[1000, 30, 45, 1, 1],
[1000, 30, 33, 3, 2],
[1000, 15, 27, 2, 1],
[1000, 30, 28, 2, 2],
[1000, 30, 29, 2, 1]])
In [122]: y = np.asarray(loan_df['loan_status'])
y [0:5]
https://dataplatform.cloud.ibm.com/a(nalytics/notebooks/v2/9336)fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
Out[122]: array([1, 1, 1, 1, 1])
IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK
In [123]: from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/utils/validation.py:475: DataConversionWarning:
Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
Out[123]: array([[ 0.52, 0.92, 2.33, -1. , -0.42],
[ 0.52, 0.92, 0.34, 1.84, 2.38],
[ 0.52, -0.96, -0.65, 0.42, -0.42],
[ 0.52, 0.92, -0.49, 0.42, 2.38],
[ 0.52, 0.92, -0.32, 0.42, -0.42]])
In [124]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Train set: (276, 5) (276,)
Test set: (70, 5) (70,)
In [125]: from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
Out[125]: LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
In [126]: yhat = LR.predict(X_test)
yhat
Out[126]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1])
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 33/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
In [127]: yhat_prob = LR.predict_proba(X_test) Upgrade Dr Mohammad Ayoub K… DK
IBM WatsonyShtautd_iporob
34/39
Out[127]: array([[ 0.56, 0.44],
[ 0.62, 0.38],
[ 0.6 , 0.4 ],
[ 0.55, 0.45],
[ 0.58, 0.42],
[ 0.59, 0.41],
[ 0.58, 0.42],
[ 0.59, 0.41],
[ 0.55, 0.45],
[ 0.58, 0.42],
[ 0.56, 0.44],
[ 0.57, 0.43],
[ 0.68, 0.32],
[ 0.56, 0.44],
[ 0.63, 0.37],
[ 0.66, 0.34],
[ 0.54, 0.46],
[ 0.6 , 0.4 ],
[ 0.56, 0.44],
[ 0.58, 0.42],
[ 0.63, 0.37],
[ 0.57, 0.43],
[ 0.55, 0.45],
[ 0.61, 0.39],
[ 0.67, 0.33],
[ 0.56, 0.44],
[ 0.56, 0.44],
[ 0.7 , 0.3 ],
[ 0.56, 0.44],
[ 0.67, 0.33],
[ 0.6 , 0.4 ],
[ 0.61, 0.39],
[ 0.61, 0.39],
[ 0.58, 0.42],
[ 0.68, 0.32],
[ 0.61, 0.39],
[ 0.56, 0.44],
[ 0.62, 0.38],
[ 0.62, 0.38],
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 0.39], Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
0.44],
[ 0.61, 0.42], Upgrade 35/39
IBM Watson Studio [ 0.56, 0.38],
0.44],
[ 0.58, 0.4 ],
[ 0.62, 0.43],
[ 0.56, 0.4 ],
[ 0.6 , 0.43],
[ 0.57, 0.39],
[ 0.6 , 0.39],
[ 0.57, 0.37],
[ 0.61, 0.39],
[ 0.61, 0.39],
[ 0.63, 0.42],
[ 0.61, 0.36],
[ 0.61, 0.33],
[ 0.58, 0.41],
[ 0.64, 0.36],
[ 0.67, 0.4 ],
[ 0.59, 0.44],
[ 0.64, 0.35],
[ 0.6 , 0.43],
[ 0.56, 0.39],
[ 0.65, 0.46],
[ 0.57, 0.42],
[ 0.61, 0.43],
[ 0.54, 0.43],
[ 0.58, 0.34],
[ 0.57, 0.38],
[ 0.57, 0.42]])
[ 0.66,
[ 0.62,
[ 0.58,
In [128]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Out[128]: 0.7857142857142857
In [131]: from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix'
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IBM Watson Studio title= Confusion matrix ,
cmap=plt.cm.Blues): Upgrade Dr Mohammad Ayoub K… DK
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))
[[55 0]
[ 0 0]]
In [132]: cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)
plt.figure() 36/39
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')
Confusion matrix, without normalization
[[55 0]
[ 0 0]]
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 [ ]] Machine Learning - Project Notebook - IBM Watson
IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK
In [137]: print (classification_report(y_test, yhat))
precision recall f1-score support
1 0.79 1.00 0.88 55
15
2 0.00 0.00 0.00
avg / total 0.62 0.79 0.69 70
/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricW
arning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
In [136]: from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)
Out[136]: 0.60097718399940614
Model Evaluation using Test set 37/39
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592
2/19/2019 Machine Learning - Project Notebook - IBM Watson
IInBM[1 W33a]ts:onfSrtoumdisoklearn.metrics import jaccard_similarity_score Upgrade Dr Mohammad Ayoub K… DK
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
First, download and load the test set:
In [134]: !wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_test.csv
--2019-01-30 11:46:37-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_test.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3642 (3.6K) [text/csv]
Saving to: ‘loan_test.csv’
100%[======================================>] 3,642 --.-K/s in 0s
2019-01-30 11:46:37 (616 MB/s) - ‘loan_test.csv’ saved [3642/3642]
Load Test set for evaluation
In [135]: test_df = pd.read_csv('loan_test.csv')
test_df.head()
Out[135]:
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
01 1 PAIDOFF 1000 30 9/8/2016 10/7/2016 50 Bechalor female
15 5 PAIDOFF 300 7 9/9/2016 9/15/2016 35 Master or Above male
2 21 21 PAIDOFF 1000 30 9/10/2016 10/9/2016 43 High School or Below female
3 24 24 PAIDOFF 1000 30 9/10/2016 10/9/2016 26 college male
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 38/39
2/19/2019 Machine Learning - Project Notebook - IBM Watson
4 35 35 PAIDOFF 800 15 9/11/2016 9/25/2016 29 Bechalor male
IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK
In [ ]:
In [ ]:
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 39/39