Classification using higher order interactions: a hybrid approach
For representation in this blog, we have considered German credit data for classification of good and bad credit risk using the interaction of the input variables through a hybrid approach. In this approach, first we developed a decision tree for identifying the significant interactions of the input variables and then, a feed forward neural network is developed with these significant interactions as input variables. The proposed method improves the classification accuracy and it provides a mathematical model for the future respondent’s classification.
If we are only interested in the best possible classification accuracy, it might be difficult or impossible to find a single classifier that performs well as a good ensemble of classifiers. Generally, researchers concentrate more on main effects of the input variables in any classification method. But in many cases, it is observed that not only main effects but also the interactions yield significant improvement in the classification problems. We integrated two or more methods based on merits of the methods to improve the classification accuracy as well as better interpretability and prediction. As we know that the decision tree does not provide the mathematical model for classification of new object into the defined groups, but it gives the suitable predictors for the classification. Whereas neural networks provide the mathematical model in the form of weight matrices, it has the problem of selection of necessary input variables for the classification.
We used the decision tree method to reduce the dimensionality so that we can find only significant explanatory variables as well as interactions for a classification problem. Additionally, we reduced dimensionality used for classification using neural networks in the proposed method. An attempt was made to combine both decision trees and neural networks to get a mathematical model for classification with necessary variables that are identified using decision tree. As we have a model with interactions, management can have an eye on interactions (combination of variables) instead of individual variables. By considering the combination of variables, management can take the decisions wisely in order to improve the profitability. The proposed procedure has been explained in detail with an illustration in the next section.
A publicly available data set known as the German credit data  contains observations on 20 variables for 1000 past applicants for credit. In addition, the resulting credit rating (Good or Bad) for each applicant was recorded. The objective was to develop a credit classification rule that can be used to determine if a new applicant is a good credit risk or a bad credit risk based on values for one or more of the 20 explanatory variables namely, 1) Status of existing checking account, 2) Duration in month, 3) Credit history, 4) Purpose, 5) Credit amount, 6) Savings account/bonds, 7) Present employment since, 8) Instalment rate in percentage of disposable income, 9) Personal status and sex, 10) Other debtors/guarantors, 11) Present residence since, 12) Property, 13) Age in years, 14) Other instalment plans , 15) Housing, 16) Number of existing credits at this bank, 17) Job, 18) Number of people being liable to provide maintenance for, 19) Telephone and 20) Foreign worker (see Johnson and Wichern (2002)). Essentially, then we must develop a function of several variables that allows us to classify a new applicant into one of two categories good or bad credit risk.
We developed a classification procedure using decision tree and a hybrid approach of combining the decision tree and neural networks approaches by using 90% of data for modelling and remaining 10% of data (Holdout data) for validation of the mode
Classification using DT
Classification of good or bad credit risk using decision tree technique is discussed in this section. CHAID method is used to construct a decision tree for the classification of good or bad credit risk based on all the 20 explanatory variables. The decision tree contains the total 15 nodes and resulting terminal nodes are 9. CHAID analysis includes the following independent variables with minimum classification risk are: 1) Status of existing checking account, 2) Credit history, 3) Property, 4) Personal status and sex, and 5) Other instalment plans. The resultant classification for training is 74.30% and for testing, it is 72.30%
Classification using Proposed Method
From the above decision tree approach, it is observed that there are nine significant interactions (terminal nodes) emerged to classify good or bad credit risk and the 9 interaction effects are as follows: X1=Status of existing checking account is >= 200DM. X2=the combination of existing checking account <0DM and Critical account/other credits existing (Not at this bank). X3=the combination of account <0DM and other than critical account. X4=The combination account in between 0 to 200 DM and Real estate. X5= the combination of checking account is in between 0-200DM, property other than real estate and personal status is male and single. X6= the combination of checking account is in between 0-200DM, property other than real estate and female divorced/separated. X7= the combination of no checking account, no instalment plans and credit history. X8= the combination of no checking account, no instalment plans and other than critical account. X9= the combination of no checking account, other instalment plans, banks/stores.
We built a feed forward neural networks model for the classification of credit risk using the above mentioned nine non-overlapping categories. (Terminal nodes of the decision tree). We divided the total sample into three sets as training, testing and holdout set and got the classification accuracy of 74.5% and 78%.
From the above study, it is clear that the proposed hybrid approach performs better than the decision tree method. It is also evident that the hybrid approach can perform well with the limited number of variables and which makes the decision maker to concentrate on the combinations (interactions) of these key variables that are identified under decision tree. This hybrid approach gives a mathematical function for the classification of good or bad credit risk whereas the decision trees fails.