The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Machine Learning - Project Notebook - IBM Watson

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by Dr Ayoub Khan, 2019-03-05 01:01:34

Machine Learning - Project Notebook - IBM Watson

Machine Learning - Project Notebook - IBM Watson

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

IBM Watson Studio Upgrade

Machine Learning - Project Notebook

VERSION AUTHOR LAST UPDATED LANGUAGE

Jeyashri Ganesan 30 Jan 2019, 5:29 PM Python 3.5

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 1/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

IBM Watson Studio Upgrade

(https://www.bigdatauniversity.com)

Classification with Python

In this notebook we try to practice all the classification algorithms that we learned in this course.

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

Lets first load required libraries:

In [2]: import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

About dataset

This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes
following fields:

Field Description

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 2/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM Watson Studio Loan_status Whether a loan is paid off on in collection Upgrade Dr Mohammad Ayoub K… DK
Principal Basic principal loan amount at the

Terms Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule

Effective_date When the loan got originated and took effects

Due_date Since it’s one-time payoff schedule, each loan has one single due date

Age Age of applicant

Education Education of applicant

Gender The gender of applicant

Lets download the dataset

In [3]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv

--2019-01-30 10:39:14-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’

100%[======================================>] 23,101 --.-K/s in 0.002s

2019-01-30 10:39:14 (13.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]

Load Data From CSV File 3/39

In [4]: df = pd.read_csv('loan_train.csv')
df.head()

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

Out[4]: Dr Meodhuacmatmioand AGyeonudbeKr… DK
IBM Watson StuUdnionamed: 0 Unnamed: 0.1 loan_status Principal terms effective_dateUpdgruaed_edate age

00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male

12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female

23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male

34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female

46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male

In [5]: df.shape
Out[5]: (346, 10)

Convert to date time object

In [6]: df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

Out[6]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

00 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below male

12 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor female

23 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college male

34 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college female

46 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college male

Data visualization and pre-processing 4/39

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

Let’s see how many of each class is in our data set Upgrade
IBM Watson Studio

In [7]: df['loan_status'].value_counts()

Out[7]: PAIDOFF 260

COLLECTION 86

Name: loan_status, dtype: int64

260 people have paid off the loan on time while 86 have gone into collection

Lets plot some columns to underestand data better:

In [8]: # notice: installing seaborn might takes a few minutes
!conda install -c anaconda seaborn -y
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

seaborn: 0.8.0-py35h15a2772_0 --> 0.9.0-py35_0 anaconda

seaborn-0.9.0- 100% |################################| Time: 0:00:00 6.79 MB/s

In [9]: import seaborn as sns

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 5/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

IBM Watson Studio Upgrade

In [13]: bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

Pre-processing: Feature selection/extraction 6/39

Lets look at the day of the week people get the loan

In [14]: df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan status", palette="Set1", col wrap=2)

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 g sns.FacetGrid(df, col Gender , hue Mlaocahnin_esLteaartniunsg - ,PropjeactlNeottetbeook S- IeBtM1W,atsocnol_wrap 2)

g.map(plt.hist, 'dayofweek', bins=bins, ec="k") Upgrade Dr Mohammad Ayoub K… DK
IBM WatsongS.tauxdeiso[-1].legend()

plt.show()

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4

In [15]: df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
df.head()

Out[15]:

Unnamed: Unnamed: loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0.1

00 0 PAIDOFF 1000 30 2016-09-08 2016-10- 45 High male 3 0
07 School or
Below

12 2 PAIDOFF 1000 30 2016-09-08 2016-10- 33 Bechalor female 3 0
07

23 3 PAIDOFF 1000 15 2016-09-08 2016-09- 27 college male 3 0
22

34 4 PAIDOFF 1000 30 2016-09-09 2016-10- 28 college female 4 1
08

https://dataplatform.cloud.ibm4.com6/analytics/noteb6ooks/v2/9336fa5PdA-dIdDb2O-4FcF18-b57e1-0040801a3d20e23d0/view?ac2c0es1s6_t-o0k9en-0=f915d66bb02808912f652-18901-c7e24c95dd4cdod4ll5ef0gfe697d9ffc0me6a5lbeb4bb7d49199ebad4592 1 7/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson g DK
Dr Mohammad Ayoub K…
IBM Watson Studio 08

Upgrade

Convert Categorical features to numerical values

Lets look at gender:

In [16]: df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

Out[16]: Gender loan_status

female PAIDOFF 0.865385

COLLECTION 0.134615

male PAIDOFF 0.731293

COLLECTION 0.268707

Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan

Lets convert male to 0 and female to 1:

In [17]: df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
df.head()

Out[17]:

Unnamed: Unnamed: loan_status Principal terms effective_date due_date age education Gender dayofweek weekend
0 0.1

00 0 PAIDOFF 1000 30 2016-09-08 2016-10- 45 High 0 3 0
07 School or
Below

12 2 PAIDOFF 1000 30 2016-09-08 2016-10- 33 Bechalor 1 3 0
07

23 3 PAIDOFF 1000 15 2016-09-08 2016-09- 27 college 0 3 0
22

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 8/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM Watson 3Stu4dio 4 PAIDOFF 1000 30 2016-09-09 02081U6p-1g0ra-de28 college 1Dr Moham4mad Ayoub K1… DK
46 6 PAIDOFF 1000
30 2016-09-09 2016-10- 29 college 04 1
08

One Hot Encoding

How about education?

In [18]: df.groupby(['education'])['loan_status'].value_counts(normalize=True)

Out[18]: education loan_status

Bechalor PAIDOFF 0.750000
0.250000
COLLECTION 0.741722
0.258278
High School or Below PAIDOFF 0.500000
0.500000
COLLECTION 0.765101
0.234899
Master or Above COLLECTION

PAIDOFF

college PAIDOFF

COLLECTION

Name: loan_status, dtype: float64

Feature befor One Hot Encoding

In [19]: df[['Principal','terms','age','Gender','education']].head()

Out[19]:

Principal terms age Gender education

0 1000 30 45 0 High School or Below

1 1000 30 33 1 Bechalor

2 1000 15 27 0 college

3 1000 30 28 1 college

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 9/39

2/19/2019 30 29 0 college Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

IBM Watson 4Stu1d0i0o0 Upgrade

Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame

In [20]: Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()

Out[20]:

Principal terms age Gender weekend Bechalor High School or Below college

0 1000 30 45 0 0 0 1 0

1 1000 30 33 1 0 1 0 0

2 1000 15 27 0 0 0 0 1

3 1000 30 28 1 1 0 0 1

4 1000 30 29 0 1 0 0 1

Feature selection

Lets defind feature sets, X:

In [21]: X = Feature
X[0:5]

Out[21]:

Principal terms age Gender weekend Bechalor High School or Below college

0 1000 30 45 0 0 0 1 0

1 1000 30 33 1 0 1 0 0

2 1000 15 27 0 0 0 0 1

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 10/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM Watson 3Stu1d0i0o0 30 28 1 1 00 Upg1rade Dr Mohammad Ayoub K… DK
4 1000 30 29 0 1 00 1

What are our lables?

In [22]: y = df['loan_status'].values
y[0:5]

Out[22]: array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [23]: X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

Out[23]: array([[ 0.51578458, 0.92071769, 2.33152555, -0.42056004, -1.20577805,
-0.38170062, 1.13639374, -0.86968108],

[ 0.51578458, 0.92071769, 0.34170148, 2.37778177, -1.20577805,
2.61985426, -0.87997669, -0.86968108],

[ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
-0.38170062, -0.87997669, 1.14984679],

[ 0.51578458, 0.92071769, -0.48739188, 2.37778177, 0.82934003,
-0.38170062, -0.87997669, 1.14984679],

[ 0.51578458, 0.92071769, -0.3215732 , -0.42056004, 0.82934003,
-0.38170062, -0.87997669, 1.14984679]])

Classification

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following 11/39

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

algorithm: Upgrade
IBM Watson Studio
K Nearest Neighbor(KNN)
Decision Tree
Support Vector Machine
Logistic Regression

Notice:

You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
You should include the code of the algorithm in the following cells.

K Nearest Neighbor(KNN)

Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.

In [156]: import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [157]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv

--2019-01-30 11:57:45-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E 12/39
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Length: 23101 (23K) [text/csv] Machine Learning - Project Notebook - IBM Watson

IBM WatsonSSatvuidnigo to: ‘loan_train.csv’ Upgrade Dr Mohammad Ayoub K… DK

100%[======================================>] 23,101 --.-K/s in 0.002s

2019-01-30 11:57:45 (14.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]

In [158]: df = pd.read_csv('loan_train.csv')
df.head()

Out[158]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male

12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female

23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male

34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female

46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male

In [159]: X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/utils/validation.py:475: DataConversionWarning:
Data with input dtype object was converted to float64 by StandardScaler.

warnings.warn(msg, DataConversionWarning)

Out[159]: array([[ 0.58, 0.52, 0.92, 2.6 , -0.42, 2.33, -0.65, 0.42],
[ 0.58, 0.52, 0.92, 2.6 , -0.42, 0.34, -1.52, -2.38],
[ 0.58, 0.52, -0.96, 2.6 , 0.72, -0.65, 1.1 , 0.42],
[ 0.58, 0.52, 0.92, 3.39, -0.31, -0.49, 1.1 , -2.38],
[ 0.58, 0.52, 0.92, 3.39, -0.31, -0.32, 1.1 , 0.42]])

In [160]: from sklearn.model_selection import train_test_split 13/39
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)

Train set: (276, 8) (276,)
Test set: (70 8) (70 )

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

Test set: (70, 8) (70,)

IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK
In [161]: from sklearn.neighbors import KNeighborsClassifier

In [162]: k = 6
#Train Model and Predict
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

Out[162]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=6, p=2,
weights='uniform')

In [163]: yhat = neigh.predict(X_test)
yhat[0:5]

Out[163]: array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

In [164]: from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy: 0.996376811594
Test set Accuracy: 1.0

In [165]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

Out[165]: 1.0

In [166]: print (classification_report(y_test, yhat))
precision recall f1-score support

COLLECTION 1.00 1.00 1.00 15
PAIDOFF 1.00 1.00 1.00 55

avg / total 1.00 1.00 1.00 70

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 14/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

DeIBcM iWsaitsoonnStTudrioee Upgrade

In [139]: import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [140]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv

--2019-01-30 11:55:38-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’

100%[======================================>] 23,101 --.-K/s in 0.002s

2019-01-30 11:55:38 (14.7 MB/s) - ‘loan_train.csv’ saved [23101/23101]

In [141]: my_data = pd.read_csv("loan_train.csv", delimiter=",")
my_data[0:5]

Out[141]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male

12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female

23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male

34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female

46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male

In [142]: X = my_data[['loan_status', 'Principal', 'terms', 'effective_date', 'due_date', 'age', 'education', 'Gender']].val

ues

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 15/39

2/19/2019 ues Machine Learning - Project Notebook - IBM Watson

X[0:7] Upgrade Dr Mohammad Ayoub K… DK
IBM Watson Studio

Out[142]: array([['PAIDOFF', 1000, 30, '9/8/2016', '10/7/2016', 45,

'High School or Below', 'male'],

['PAIDOFF', 1000, 30, '9/8/2016', '10/7/2016', 33, 'Bechalor',

'female'],

['PAIDOFF', 1000, 15, '9/8/2016', '9/22/2016', 27, 'college', 'male'],

['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 28, 'college',

'female'],

['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 29, 'college', 'male'],

['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 36, 'college', 'male'],

['PAIDOFF', 1000, 30, '9/9/2016', '10/8/2016', 28, 'college', 'male']], dtype=object)

In [143]: from sklearn import preprocessing
le_loan_date = preprocessing.LabelEncoder()
le_loan_date.fit(['COLLECTION', 'PAIDOFF'])
X[:,0] = le_loan_date.transform(X[:,0])

le_eff_date = preprocessing.LabelEncoder()
le_eff_date.fit(['9/8/2016', '9/9/2016', '9/10/2016', '9/11/2016', '9/12/2016', '9/13/2016', '9/14/2016'])
X[:,3] = le_eff_date.transform(X[:,3])

le_due_date = preprocessing.LabelEncoder()
le_due_date.fit(['9/16/2016', '9/17/2016', '9/18/2016', '9/19/2016', '9/22/2016', '9/23/2016', '9/24/2016', '9/25/
2016', '9/26/2016', '9/27/2016', '9/28/2016',

'10/7/2016', '10/8/2016', '10/9/2016', '10/10/2016', '10/11/2016', '10/12/2016', '10/13/2016', '1
0/25/2016', '10/26/2016', '11/9/2016', '11/10/2016',

'11/12/2016'])
X[:,4] = le_due_date.transform(X[:,4])

le_education = preprocessing.LabelEncoder()
le_education.fit(['Bechalor', 'High School or Below', 'college', 'Master or Above'])
X[:,6] = le_education.transform(X[:,6])

le_Gender = preprocessing.LabelEncoder()
le_Gender.fit(['female', 'male'])
X[:,7] = le_Gender.transform(X[:,7])

X[0:7] 16/39

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

Out[143]: array([[1, 1000, 30, 5, 6, 45, 1, 1], Upgrade Dr Mohammad Ayoub K… DK
IBM Watson Studio [1, 1000, 30, 5, 6, 33, 0, 0],
17/39
[1, 1000, 15, 5, 16, 27, 3, 1],
[1, 1000, 30, 6, 7, 28, 3, 0],
[1, 1000, 30, 6, 7, 29, 3, 1],
[1, 1000, 30, 6, 7, 36, 3, 1],
[1, 1000, 30, 6, 7, 28, 3, 1]], dtype=object)

In [144]: y = my_data["loan_status"]
y[0:7]

Out[144]: 0 PAIDOFF
1 PAIDOFF
2 PAIDOFF
3 PAIDOFF
4 PAIDOFF
5 PAIDOFF
6 PAIDOFF
Name: loan_status, dtype: object

In [145]: from sklearn.model_selection import train_test_split

In [146]: X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

In [147]: X_trainset.shape
y_trainset.shape

Out[147]: (242,)

In [148]: X_testset.shape
y_testset.shape

Out[148]: (104,)

In [149]: LoanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
LoanTree

Out[149]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min samples leaf=1 min samples split=2

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM Watson Studio min_samples_leaf=1, min_samples_split=2,

min_weight_fraction_leaf=0.0, presort=False, random_stUaptger=aNdoene, Dr Mohammad Ayoub K… DK
splitter='best')
18/39
In [150]: LoanTree.fit(X_trainset,y_trainset)

Out[150]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

In [151]: predTree = LoanTree.predict(X_testset)

In [152]: print (predTree [0:5])
print (y_testset [0:5])

['PAIDOFF' 'PAIDOFF' 'COLLECTION' 'COLLECTION' 'PAIDOFF']
73 PAIDOFF
24 PAIDOFF
282 COLLECTION
295 COLLECTION
163 PAIDOFF
Name: loan_status, dtype: object

In [153]: from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

DecisionTrees's Accuracy: 1.0

In [154]: print (classification_report(y_test, yhat))
precision recall f1-score support

1 0.79 1.00 0.88 55
15
2 0.00 0.00 0.00

avg / total 0.62 0.79 0.69 70

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricW DK

IBM WatsonaSrtnuidnigo: Precision and F-score are ill-defined and being set to 0.0Upignraldaebels with no preDdriMctoehdamsmamapdlAeyso.ub K…
'precision', 'predicted', average, warn_for)

In [155]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

Out[155]: 0.7857142857142857

In [112]:

---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-112-aa589220b1ce> in <module>()

3 featureNames = my_data.columns[0:5]

4 targetNames = my_data["loan_status"].unique().tolist()

----> 5 out=tree.export_graphviz(LoanTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y

_trainset), filled=True, special_characters=True,rotate=False)

6 graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

7 graph.write_png(filename)

NameError: name 'tree' is not defined

Support Vector Machine

In [50]: import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt

In [51]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv

--2019-01-30 10:47:39-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E 19/39
Nv3/labs/loan train csv

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Nv3/labs/loan_train.csv Machine Learning - Project Notebook - IBM Watson

Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193 DK
IBM WatsonCSotnundeicoting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.usU-pggeroa.doebjectstorage.sofDtrlMayoehra.mnmeta)d|A6y7o.u2b28K.…254.19

3|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 23101 (23K) [text/csv]

Saving to: ‘loan_train.csv’

100%[======================================>] 23,101 --.-K/s in 0.002s

2019-01-30 10:47:39 (11.3 MB/s) - ‘loan_train.csv’ saved [23101/23101]

In [53]: loan_df = pd.read_csv("loan_train.csv")
loan_df.head()

Out[53]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male

12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female

23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male

34 4 PAIDOFF 1000 30 9/9/2016 10/8/2016 28 college female

46 6 PAIDOFF 1000 30 9/9/2016 10/8/2016 29 college male

In [58]: ax = loan_df[loan_df['loan_status'] == 'PAIDOFF'][0:50].plot(kind='scatter', x='age', y='Principal', color='DarkBl
ue', label='Paidoff');
loan_df[loan_df['loan_status'] == 'COLLECTION'][0:50].plot(kind='scatter', x='age', y='Principal', color='Yellow',
label='Collection', ax=ax);
plt.show()

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 20/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

IBM Watson Studio Upgrade

In [59]: loan_df.dtypes int64
int64
Out[59]: Unnamed: 0 object
Unnamed: 0.1 int64
loan_status int64
Principal object
terms object
effective_date int64
due_date object
age object
education
Gender
dtype: object

In [86]: import pandas as pd
file_handler = open("loan_train.csv", 'r')
mydata = pd.read_csv(file_handler, sep = ",")
file_handler.close()
gender = {'male': 1,'female': 2}
mydata.Gender = [gender[item] for item in mydata.Gender]
loanstatus = {'PAIDOFF': 1,'COLLECTION': 2}
mydata.loan_status = [loanstatus[item] for item in mydata.loan_status]
education1 = {'High School or Below': 1, 'college': 2, 'Bechalor': 3, 'Master or Above': 4}
mydata.education = [education1[item] for item in mydata.education]
print(mydata)

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date \

0 0 0 1 1000 30 9/8/2016

1 2 2 1 1000 30 9/8/2016

2 3 3 1 1000 15 9/8/2016

3 4 4 1 1000 30 9/9/2016

4 6 6 1 1000 30 9/9/2016

5 7 7 1 1000 30 9/9/2016

6 8 8 1 1000 30 9/9/2016
https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 21/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson
6 8 8 1 1000 30 9/9/2016

IBM Watson78Studio 9 9 1 800 15 U99p//g11r00a//d22e001166 Dr Mohammad Ayoub K… DK
10 10 1 300 7
22/39
9 11 11 1 1000 15 9/10/2016

10 12 12 1 1000 30 9/10/2016

11 13 13 1 900 7 9/10/2016

12 14 14 1 1000 7 9/10/2016

13 15 15 1 800 15 9/10/2016

14 16 16 1 1000 30 9/10/2016

15 17 17 1 1000 15 9/10/2016

16 18 18 1 1000 30 9/10/2016

17 19 19 1 800 30 9/10/2016

18 20 20 1 1000 30 9/10/2016

19 22 22 1 1000 30 9/10/2016

20 23 23 1 1000 15 9/10/2016

21 25 25 1 1000 30 9/10/2016

22 26 26 1 800 15 9/10/2016

23 27 27 1 1000 15 9/10/2016

24 28 28 1 1000 30 9/11/2016

25 29 29 1 1000 30 9/11/2016

26 30 30 1 800 15 9/11/2016

27 31 31 1 1000 30 9/11/2016

28 32 32 1 1000 30 9/11/2016

29 33 33 1 1000 30 9/11/2016

.. ... ... ... ... ... ...

316 367 367 2 800 15 9/11/2016

317 368 368 2 1000 30 9/11/2016

318 371 371 2 1000 30 9/11/2016

319 372 372 2 800 15 9/11/2016

320 373 373 2 1000 15 9/11/2016

321 374 374 2 1000 30 9/11/2016

322 375 375 2 800 15 9/11/2016

323 376 376 2 1000 15 9/11/2016

324 377 377 2 1000 30 9/11/2016

325 378 378 2 1000 30 9/11/2016

326 379 379 2 1000 30 9/11/2016

327 380 380 2 1000 30 9/11/2016

328 381 381 2 800 15 9/11/2016

329 382 382 2 1000 15 9/11/2016

330 383 383 2 1000 30 9/11/2016

331 384 384 2 1000 30 9/11/2016

332 385 385 2 1000 30 9/11/2016

333 386 386 2 800 15 9/11/2016

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 333 386 386 M2achine Learn8in0g0- Project N1o5tebook - IB9M/W11at/so2n016
387 387
334 388 388 2 1000 30 9/11/2016
IBM Watson3S3t5udio 389 389 U9p/g1r1a/d2e016 Dr Mohammad Ayoub K… DK
390 390 2 1000 15
391 391 23/39
336 392 392 2 1000 15 9/11/2016
393 393
337 394 394 2 1000 15 9/11/2016
395 395
338 397 397 2 800 15 9/11/2016
398 398
339 399 399 2 1000 30 9/11/2016

340 2 1000 30 9/11/2016

341 2 800 15 9/11/2016

342 2 1000 30 9/11/2016

343 2 800 15 9/12/2016

344 2 1000 30 9/12/2016

345 2 1000 30 9/12/2016

due_date age education Gender

0 10/7/2016 45 11

1 10/7/2016 33 32

2 9/22/2016 27 21

3 10/8/2016 28 22

4 10/8/2016 29 21

5 10/8/2016 36 21

6 10/8/2016 28 21

7 9/24/2016 26 21

8 9/16/2016 29 21

9 10/9/2016 39 11

10 10/9/2016 26 21

11 9/16/2016 26 22

12 9/16/2016 27 11

13 9/24/2016 26 21

14 10/9/2016 40 11

15 9/24/2016 32 11

16 10/9/2016 32 11

17 10/9/2016 26 21

18 10/9/2016 26 21

19 10/9/2016 25 11

20 9/24/2016 26 21

21 10/9/2016 29 11

22 9/24/2016 39 31

23 9/24/2016 34 31

24 10/10/2016 31 21

25 10/10/2016 33 21

26 9/25/2016 33 11

27 10/10/2016 37 21

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 0/ 0/ 0 6 3 Machine Learning - Project Notebook - IBM Watson
27
28 10/10/2016 37 21
IBM Watson2S9tudi1o0/10/2016 ... 21 Upgrade Dr Mohammad Ayoub K… DK
28 ... ...
.. ... 24 31 24/39
18 21
316 9/25/2016 25 21
40 11
317 10/10/2016 29 11
26 21
318 10/10/2016 30 12
33 21
319 9/25/2016 30 21
32 21
320 9/25/2016 25 21
35 11
321 10/10/2016 30 11
26 31
322 9/25/2016 29 11
26 11
323 9/25/2016 46 11
36 11
324 10/10/2016 38 11
32 31
325 10/10/2016 30 11
35 21
326 10/10/2016 29 11
26 22
327 10/10/2016 32 21
25 11
328 9/25/2016 39 11
28 21
329 9/25/2016 26 21
21
330 10/10/2016

331 10/10/2016

332 11/9/2016

333 9/25/2016

334 10/10/2016

335 9/25/2016

336 10/25/2016

337 9/25/2016

338 9/25/2016

339 10/10/2016

340 11/9/2016

341 9/25/2016

342 10/10/2016

343 9/26/2016

344 11/10/2016

345 10/11/2016

[346 rows x 10 columns]

In [95]: mydata.drop(mydata.columns[[5,6]], axis=1, inplace=True)

In [96]: mydata.dtypes

Out[96]: Unnamed: 0 int64

https://dataplatform.cloud.ibmU.com/anadlytics0/no1tebooksi/v2t/963436fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 int64 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
int64
Unnamed: 0.1 int64 Upgrade 25/39
IBM WatsonlSotaund_isotatus int64
int64
Principal int64
terms int64
age
education
Gender
dtype: object

In [97]: feature_df = mydata[['loan_status', 'Principal', 'terms', 'age', 'education', 'Gender']]
X = np.asarray(feature_df)
X[0:5]

Out[97]: array([[ 1, 1000, 30, 45, 1, 1],
[ 1, 1000, 30, 33, 3, 2],
[ 1, 1000, 15, 27, 2, 1],
[ 1, 1000, 30, 28, 2, 2],
[ 1, 1000, 30, 29, 2, 1]])

In [90]: mydata['loan_status'] = mydata['loan_status'].astype('int')
y = np.asarray(mydata['loan_status'])
y [0:5]

Out[90]: array([1, 1, 1, 1, 1])

In [98]: X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)

Train set: (276, 6) (276,)
Test set: (70, 6) (70,)

In [99]: from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

Out[99]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

In [100]: yhat clf predict(X test)

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK

In [100]: yhat = clf.predict(X_test) Upgrade 26/39
IBM WatsonyShtautdi[o0:5]

Out[100]: array([1, 1, 1, 1, 1])

In [109]: from sklearn.metrics import classification_report, confusion_matrix
import itertools

In [111]: def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")

else:
print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show

In [113]: cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM WatsonpSrtiundtio(classification_report(y_test, yhat)) Upgrade Dr Mohammad Ayoub K… DK

# Plot non-normalized confusion matrix title='Confusion matri
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['PAIDOFF(1)','COLLECTION(2)'],normalize= False,
x')

precision recall f1-score support

1 0.90 1.00 0.95 55
15
2 1.00 0.60 0.75

avg / total 0.92 0.91 0.91 70

Confusion matrix, without normalization
[[9 0]

[0 0]]

In [114]: from sklearn.metrics import f1_score 27/39
f1_score(y_test, yhat, average='weighted')

Out[114]: 0.90578817733990147

https://dataIplatfo[rm11.c5lo]ud.ibmf.com/anaklyltics/notebookts/vi2/9336ifa5d-ddtb2-j4c18-b57de-048i1ai3dl20ei2dt/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

In [115]: from sklearn.metrics import jaccard_similarity_score

IBM WatsonjSatcucdairod_similarity_score(y_test, yhat) Upgrade Dr Mohammad Ayoub K… DK

Out[115]: 0.91428571428571426

Logistic Regression

In [116]: import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt

In [117]: !wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_train.csv

--2019-01-30 11:36:35-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’

100%[======================================>] 23,101 --.-K/s in 0.002s

2019-01-30 11:36:35 (12.6 MB/s) - ‘loan_train.csv’ saved [23101/23101]

In [118]: loan_df = pd.read_csv("loan_train.csv")
loan_df.head()

Out[118]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

00 0 PAIDOFF 1000 30 9/8/2016 10/7/2016 45 High School or Below male

12 2 PAIDOFF 1000 30 9/8/2016 10/7/2016 33 Bechalor female

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 28/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

23 3 PAIDOFF 1000 15 9/8/2016 9/22/2016 27 college male
IBM Watson 3Stu4dio 4 PAIDOFF
6 PAIDOFF 1000 30 9/9/2016 Up1g0ra/8d/e2016 28 college Dr Mohammad AfeymouableK… DK
46
1000 30 9/9/2016 10/8/2016 29 college male 29/39

In [119]: import pandas as pd
gender = {'male': 1,'female': 2}
loan_df.Gender = [gender[item] for item in loan_df.Gender]
loanstatus = {'PAIDOFF': 1,'COLLECTION': 2}
loan_df.loan_status = [loanstatus[item] for item in loan_df.loan_status]
education1 = {'High School or Below': 1, 'college': 2, 'Bechalor': 3, 'Master or Above': 4}
loan_df.education = [education1[item] for item in loan_df.education]
loan_df.drop(loan_df.columns[[5,6]], axis=1, inplace=True)
print(loan_df)

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms age education \

0 0 0 1 1000 30 45 1

1 2 2 1 1000 30 33 3

2 3 3 1 1000 15 27 2

3 4 4 1 1000 30 28 2

4 6 6 1 1000 30 29 2

5 7 7 1 1000 30 36 2

6 8 8 1 1000 30 28 2

7 9 9 1 800 15 26 2

8 10 10 1 300 7 29 2

9 11 11 1 1000 15 39 1

10 12 12 1 1000 30 26 2

11 13 13 1 900 7 26 2

12 14 14 1 1000 7 27 1

13 15 15 1 800 15 26 2

14 16 16 1 1000 30 40 1

15 17 17 1 1000 15 32 1

16 18 18 1 1000 30 32 1

17 19 19 1 800 30 26 2

18 20 20 1 1000 30 26 2

19 22 22 1 1000 30 25 1

20 23 23 1 1000 15 26 2

21 25 25 1 1000 30 29 1

22 26 26 1 800 15 39 3

23 27 27 1 1000 15 34 3

24 28 28 1 1000 30 31 2

https://dataplatform.cloud.ibm2.5com/analytics/noteb2o9oks/v2/9336fa5d-ddb22-94c18-b57e-0481a3d210e2d/view?a1c0ce0s0s_token=f3105d66bb38392f52891c7e4c5d2d4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

25 29 29 1 1000 30 33 2
IBM Watson2S6tudio 30
31 30 1 800 15 33 Upgrade 1 Dr Mohammad Ayoub K… DK
27 32 31 1 1000 30 37 2
28 33 30/39
29 ... 32 1 1000 30 27 2
.. 367
316 368 33 1 1000 30 37 2
317 371
318 372 ... ... ... ... ... ...
319 373
320 374 367 2 800 15 28 3
321 375
322 376 368 2 1000 30 24 2
323 377
324 378 371 2 1000 30 18 2
325 379
326 380 372 2 800 15 25 1
327 381
328 382 373 2 1000 15 40 1
329 383
330 384 374 2 1000 30 29 2
331 385
332 386 375 2 800 15 26 1
333 387
334 388 376 2 1000 15 30 2
335 389
336 390 377 2 1000 30 33 2
337 391
338 392 378 2 1000 30 30 2
339 393
340 394 379 2 1000 30 32 2
341 395
342 397 380 2 1000 30 25 1
343 398
344 399 381 2 800 15 35 1
345
382 2 1000 15 30 3

383 2 1000 30 26 1

384 2 1000 30 29 1

385 2 1000 30 26 1

386 2 800 15 46 1

387 2 1000 30 36 1

388 2 1000 15 38 3

389 2 1000 15 32 1

390 2 1000 15 30 2

391 2 800 15 35 1

392 2 1000 30 29 2

393 2 1000 30 26 2

394 2 800 15 32 1

395 2 1000 30 25 1

397 2 800 15 39 2

398 2 1000 30 28 2

399 2 1000 30 26 2

Gender
01
12
21
32
41

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

41

IBM Watson65Studio 1 Upgrade Dr Mohammad Ayoub K… DK
1
31/39
71

81

91

10 1

11 2

12 1

13 1

14 1

15 1

16 1

17 1

18 1

19 1

20 1

21 1

22 1

23 1

24 1

25 1

26 1

27 1

28 1

29 1

.. ...

316 1

317 1

318 1

319 1

320 1

321 1

322 2

323 1

324 1

325 1

326 1

327 1

328 1

329 1

330 1

331 1

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 331 1 Machine Learning - Project Notebook - IBM Watson
1
IBM Watson33S33t23udio 1 Upgrade Dr Mohammad Ayoub K… DK
1
334 1 32/39
1
335 1
1
336 2
1
337 1
1
338 1
1
339 1

340

341

342

343

344

345

[346 rows x 8 columns]

In [120]: loan_df.dtypes

Out[120]: Unnamed: 0 int64
Unnamed: 0.1 int64
loan_status int64
Principal int64
terms int64
age int64
education int64
Gender int64
dtype: object

In [121]: X = np.asarray(loan_df[['Principal', 'terms', 'age', 'education', 'Gender']])
X[0:5]

Out[121]: array([[1000, 30, 45, 1, 1],
[1000, 30, 33, 3, 2],
[1000, 15, 27, 2, 1],
[1000, 30, 28, 2, 2],
[1000, 30, 29, 2, 1]])

In [122]: y = np.asarray(loan_df['loan_status'])
y [0:5]

https://dataplatform.cloud.ibm.com/a(nalytics/notebooks/v2/9336)fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

Out[122]: array([1, 1, 1, 1, 1])

IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK

In [123]: from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/utils/validation.py:475: DataConversionWarning:
Data with input dtype int64 was converted to float64 by StandardScaler.

warnings.warn(msg, DataConversionWarning)

Out[123]: array([[ 0.52, 0.92, 2.33, -1. , -0.42],
[ 0.52, 0.92, 0.34, 1.84, 2.38],
[ 0.52, -0.96, -0.65, 0.42, -0.42],
[ 0.52, 0.92, -0.49, 0.42, 2.38],
[ 0.52, 0.92, -0.32, 0.42, -0.42]])

In [124]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)

Train set: (276, 5) (276,)
Test set: (70, 5) (70,)

In [125]: from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

Out[125]: LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

In [126]: yhat = LR.predict(X_test)
yhat

Out[126]: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1])

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 33/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

In [127]: yhat_prob = LR.predict_proba(X_test) Upgrade Dr Mohammad Ayoub K… DK
IBM WatsonyShtautd_iporob
34/39
Out[127]: array([[ 0.56, 0.44],
[ 0.62, 0.38],
[ 0.6 , 0.4 ],
[ 0.55, 0.45],
[ 0.58, 0.42],
[ 0.59, 0.41],
[ 0.58, 0.42],
[ 0.59, 0.41],
[ 0.55, 0.45],
[ 0.58, 0.42],
[ 0.56, 0.44],
[ 0.57, 0.43],
[ 0.68, 0.32],
[ 0.56, 0.44],
[ 0.63, 0.37],
[ 0.66, 0.34],
[ 0.54, 0.46],
[ 0.6 , 0.4 ],
[ 0.56, 0.44],
[ 0.58, 0.42],
[ 0.63, 0.37],
[ 0.57, 0.43],
[ 0.55, 0.45],
[ 0.61, 0.39],
[ 0.67, 0.33],
[ 0.56, 0.44],
[ 0.56, 0.44],
[ 0.7 , 0.3 ],
[ 0.56, 0.44],
[ 0.67, 0.33],
[ 0.6 , 0.4 ],
[ 0.61, 0.39],
[ 0.61, 0.39],
[ 0.58, 0.42],
[ 0.68, 0.32],
[ 0.61, 0.39],
[ 0.56, 0.44],
[ 0.62, 0.38],
[ 0.62, 0.38],

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 0.39], Machine Learning - Project Notebook - IBM Watson Dr Mohammad Ayoub K… DK
0.44],
[ 0.61, 0.42], Upgrade 35/39
IBM Watson Studio [ 0.56, 0.38],
0.44],
[ 0.58, 0.4 ],
[ 0.62, 0.43],
[ 0.56, 0.4 ],
[ 0.6 , 0.43],
[ 0.57, 0.39],
[ 0.6 , 0.39],
[ 0.57, 0.37],
[ 0.61, 0.39],
[ 0.61, 0.39],
[ 0.63, 0.42],
[ 0.61, 0.36],
[ 0.61, 0.33],
[ 0.58, 0.41],
[ 0.64, 0.36],
[ 0.67, 0.4 ],
[ 0.59, 0.44],
[ 0.64, 0.35],
[ 0.6 , 0.43],
[ 0.56, 0.39],
[ 0.65, 0.46],
[ 0.57, 0.42],
[ 0.61, 0.43],
[ 0.54, 0.43],
[ 0.58, 0.34],
[ 0.57, 0.38],
[ 0.57, 0.42]])
[ 0.66,
[ 0.62,
[ 0.58,

In [128]: from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

Out[128]: 0.7857142857142857

In [131]: from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix'

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IBM Watson Studio title= Confusion matrix ,

cmap=plt.cm.Blues): Upgrade Dr Mohammad Ayoub K… DK

if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")

else:
print('Confusion matrix, without normalization')

print(cm)

plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))

[[55 0]
[ 0 0]]

In [132]: cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)

plt.figure() 36/39
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')

Confusion matrix, without normalization
[[55 0]

[ 0 0]]

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 [ ]] Machine Learning - Project Notebook - IBM Watson

IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK

In [137]: print (classification_report(y_test, yhat))
precision recall f1-score support

1 0.79 1.00 0.88 55
15
2 0.00 0.00 0.00

avg / total 0.62 0.79 0.69 70

/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricW
arning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

'precision', 'predicted', average, warn_for)

In [136]: from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)

Out[136]: 0.60097718399940614

Model Evaluation using Test set 37/39

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592

2/19/2019 Machine Learning - Project Notebook - IBM Watson

IInBM[1 W33a]ts:onfSrtoumdisoklearn.metrics import jaccard_similarity_score Upgrade Dr Mohammad Ayoub K… DK
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

In [134]: !wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv
3/labs/loan_test.csv

--2019-01-30 11:46:37-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101E
Nv3/labs/loan_test.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.19
3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3642 (3.6K) [text/csv]
Saving to: ‘loan_test.csv’

100%[======================================>] 3,642 --.-K/s in 0s

2019-01-30 11:46:37 (616 MB/s) - ‘loan_test.csv’ saved [3642/3642]

Load Test set for evaluation

In [135]: test_df = pd.read_csv('loan_test.csv')
test_df.head()

Out[135]:

Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender

01 1 PAIDOFF 1000 30 9/8/2016 10/7/2016 50 Bechalor female

15 5 PAIDOFF 300 7 9/9/2016 9/15/2016 35 Master or Above male

2 21 21 PAIDOFF 1000 30 9/10/2016 10/9/2016 43 High School or Below female

3 24 24 PAIDOFF 1000 30 9/10/2016 10/9/2016 26 college male

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 38/39

2/19/2019 Machine Learning - Project Notebook - IBM Watson

4 35 35 PAIDOFF 800 15 9/11/2016 9/25/2016 29 Bechalor male
IBM Watson Studio Upgrade Dr Mohammad Ayoub K… DK

In [ ]:

In [ ]:

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/9336fa5d-ddb2-4c18-b57e-0481a3d20e2d/view?access_token=f15d66bb892f52891c7e4c5dd4dd45f0f697d9ffc0e65bb4bb7d9199ebad4592 39/39


Click to View FlipBook Version