Machine Learning Basics with Classification Algorithms

Machine Learning Basics with Classification Algorithms We will not be using classification algorithms in the final code, however anyone using Machine Learning tools should know them. We will explore some of these algorithms here to bring across the general concepts for Machine Learning before moving on to regression algorithms. Decision Trees Decision Trees are, as the name suggests, trees of decisions. The decisions to be made to classify your data are binary (yes/no, greater or less than) in the standard Scikit-Learn library implementation. Letâ€™s try out the Scikit-Learn Decision Tree classifier on data with two dimensions (two variables) so you can see how they work visually. Bear in mind this algorithm can work on many more dimensions if desired. The Scikit-Learn library is well documented, you can see all the specifics for the Decision Tree as well as the other predictors and tools on in the online reference if you need to find a particular setting. For the Decision Tree it is: https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html. We will be using fabricated data based on Altman-Z scores to estimate the probability of bankruptcy for a company. The Altman-Z score takes a variety of metrics on a company (such as EBIT/Total Assets) and weighting them to get a final score. The lower the score the higher the risk of going bankrupt. The training data we will use is the Altman_Z_2D.csv file from the Github repository. This data has three columns, the first stating whether a company has gone bankrupt, and the other two are company ratios EBIT/Total Assets and MktValEquity/Debt (before bankruptcy). This is easy to see by loading the data with Pandas read_csv function and looking at the head of the data:

In[1]:

import pandas as pd # Importing modules for use.

import numpy as np

import matplotlib.pyplot as plt # For plotting scatter plot

data = pd.read_csv('Altman_Z_2D.csv', index_col=0) # Load the .csv data

data.head(5) # Taking a look at the data.

Out[1]:

Plotting this data visually in a scatter plot makes the relationship to bankruptcy more obvious. Scatterplots are available from the Pandas DataFrame directly with DataFrame.plot.scatter():

In[2]:

# Bankruptcy mask (list of booleans)

bankrupt_mask = data['Bankrupt'] == True

# Plot the bankrupt points

plt.scatter(data['EBIT/Total Assets'][bankrupt_mask],\

data['MktValEquity/Debt'][bankrupt_mask],\

marker='x')

# Plot the nonbankrupt points

plt.scatter(data['EBIT/Total Assets'][~bankrupt_mask],\

data['MktValEquity/Debt'][~bankrupt_mask],\

marker='o')

# Formatting

plt.xlabel('EBIT/Total Assets')

plt.ylabel('MktValEquity/Debt')

plt.grid()

plt.legend(['Bankrupt','Non bankrupt'])

Letâ€™s use the Scikit-Learn library to make a Decision Tree from this data to identify companies that will go bust. First, we have to split up the data into a matrix, X, which contains all the feature values, and vector Y which contains the row classification labels, which are the True/False for bankruptcy. We will want to import the Decision Tree classifier from Scikit-Learn too.

In[3]:

# Split up the data for the classifier to be trained.

# X is data

# Y is the answer we want our classifier to replicate.

X = data[['EBIT/Total Assets','MktValEquity/Debt']]

Y = data['Bankrupt']

# Import Scikit-Learn

from sklearn.tree import DecisionTreeClassifier

We can now create a Decision Tree object. Here we specify the depth of the Decision Tree to 2, we will get more into this value later. As with all Scikit- Learn models, the latest full documentation is available on the internet (https://scikit-learn.org/).

In[4]:

# Create a DecisionTreeClassifier object first

tree_clf = DecisionTreeClassifier(max_depth=2)

# Fit the Decision Tree to our training data of X and Y.

tree_clf.fit(X, Y)

Out[4]: DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion=â€™giniâ€™, max_depth=2, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=â€™deprecatedâ€™, random_state=None, splitter=â€™bestâ€™) After fitting (or training) your Decision Tree classifier, it is imbued with rules that you can use on new data. Say for instance if you have a company of known EBIT/Total Assets and MktValEquity/Debt, your classifier should be able to predict future bankruptcy, making a prediction drawn from what your training data would suggest:

In[5]:

# Let's see if it predicts bankruptcy for a bad company

print('Low EBIT/Total Assets and MktValEquity/Debt company go bust?', tree_clf.predict([[-20,

-10]]))

# Let's try this for a highly values, high earning company

print('High EBIT/Total Assets and MktValEquity/Debt company go bust?', tree_clf.predict([[20,

10]]))

Out[5]: Low EBIT/Total Assets and MktValEquity/Debt company go bust? [ True] High EBIT/Total Assets and MktValEquity/Debt company go bust? [False] You can pass a DataFrame with several rows and two columns for X as well if you want your model to give an answer for a large number of companies. Letâ€™s see what a contour plot of our treeâ€™s predictions in the 2D space looks like.

In[6]:

# Contour plot.

from matplotlib.colors import ListedColormap

x1s = np.linspace(-30, 40, 100)

x2s = np.linspace(-10, 15, 100)

x1, x2 = np.meshgrid(x1s, x2s)

X_new = np.c_[x1.ravel(), x2.ravel()]

y_pred = tree_clf.predict(X_new).astype(int).reshape(x1.shape)

custom_cmap = ListedColormap(['#2F939F','#D609A8'])

plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)

plt.plot(X['EBIT/Total Assets'][Y==False],\

X['MktValEquity/Debt'][Y==False], "bo",

X['EBIT/Total Assets'][Y==True],\

X['MktValEquity/Debt'][Y==True], "rx")

plt.xlabel('EBIT/Total Assets')

plt.ylabel('MktValEquity/Debt')

It seems our Decision Tree rules for predicting bankruptcy are quite simple, if each feature ratio (the x and y axes) are above a certain value the company isnâ€™t likely to go bust. Earlier we fixed our Decision Tree depth with the setting max_depth=2. This fixes the depth of the Decision Trees rules. With a value of 2, the complexity of the boundary between bust/not bust is not going to be that complex. Our Decision Tree has made a binary split on our X-axis at around 5, and then another split in one of the remaining domains around a Y-axis value of 0. There are only two splits, which is the maximum depth we specified. We can see the Decision Tree visually with Scikit-Learn using the plot_tree function:

In[7]:

from sklearn import tree # Need this to see Decision Tree.

plt.figure(figsize=(5,5), dpi=300) # set figsize so we can see it

tree.plot_tree(tree_clf,

feature_names=['EBIT/Total Assets','MktValEquity/Debt'],

class_names=['bust', 'nobust'],

filled = True); # semicolon here to supress output

Neat, we have an algorithm that predicts bankruptcy, and we didnâ€™t explicitly program it, furthermore, it can be trained with more data if we have it. Sure it is an algorithm so simple that a human can follow it by just looking up two numbers, but if we change the depth of the tree, and make it include a lot more than just two features, the classifier can easily be more complex than what a human can carry out.

How does a Decision Tree work? A Decision Tree splits the data along one of the feature dimensions at each level. If we limit the max_depth of our Decision Tree to 1, we get a single split, resulting in a far simpler tree, and one that isnâ€™t that good at predicting bankruptcy:

So how does the tree know where to make a split along an axis when it is fitting our data? We will need a quantitative measure of how good a split is, after which the best one can be chosen to make a new branch on our tree. This algorithm is called the Classification And Regression Tree (CART) algorithm. Weâ€™ll walk through exactly how it works here, though if you are comfortable just using the Scikit-Learn library Decision Tree you can skip this part