Stay tuned – Receive JSM-news !

Join the JSM mailing list to receive our latest updates.
Email address
ABOUT
All too often companies have only the vaguest idea about what kind of data they’re holding; because such data is very often hidden deeply away in a variety of databases and fragmented across different departments. We identify this data and bring it to light, making it visible, cohesive, comparable and easy to understand so that it really does support YOU in making the right decisions. And if need be, we can also identify any lacking data and define a concept to fill in the gap.

Classification using higher order interactions: a hybrid approach

Hybrid Modeling

Hybrid Modeling

For representation in this blog, we have considered German credit data for classification of good and bad credit risk using the interaction of the input variables through a hybrid approach. In this approach, first we developed a decision tree for identifying the significant interactions of the input variables and then, a feed forward neural network  is  developed  with  these  significant  interactions as input variables.  The  proposed  method  improves  the  classification  accuracy  and  it  provides  a mathematical model for the future respondent’s classification.

If we are only interested in the best possible classification accuracy, it might be difficult or impossible  to  find  a  single  classifier  that  performs  well  as  a  good  ensemble  of  classifiers. Generally,  researchers  concentrate  more on  main  effects  of  the  input  variables  in  any classification  method.  But  in many  cases,  it  is  observed  that  not  only  main  effects  but  also the interactions yield significant improvement in the classification problems. We integrated two or more methods based on merits of the methods to improve the classification accuracy as well as better interpretability and prediction. As we know that the decision tree does not provide the mathematical model for classification of new object into the defined groups, but it gives the suitable predictors for the classification. Whereas neural networks provide the mathematical model  in  the  form  of  weight  matrices,  it  has  the  problem  of  selection  of  necessary  input variables for the classification.

We used  the decision  tree  method  to  reduce  the  dimensionality  so that we  can  find  only  significant explanatory variables as well as interactions for a classification problem. Additionally, we reduced dimensionality used for classification using neural networks in the proposed method. An attempt was made to  combine  both  decision  trees  and  neural  networks  to get  a  mathematical  model  for classification with necessary variables that are identified using decision tree. As we have a model with  interactions,  management  can  have  an  eye  on  interactions  (combination  of  variables) instead of individual  variables. By considering the combination of variables, management can take the decisions wisely in order to improve the profitability. The proposed procedure has been explained in detail with an illustration in the next section.

Empirical Study

A publicly available data set known as the German credit data [4] contains observations on 20 variables for 1000 past applicants for credit. In addition, the resulting credit rating (Good or Bad) for each applicant was recorded. The objective was to develop a credit classification rule that can be used to determine if a new applicant is a good credit risk or a bad credit risk based on values  for  one  or  more  of  the  20  explanatory  variables  namely,  1) Status  of  existing  checking account,  2) Duration  in  month,  3)  Credit  history,  4)  Purpose, 5)  Credit  amount,  6)  Savings account/bonds,  7)  Present  employment  since,  8)  Instalment  rate  in  percentage  of  disposable income, 9) Personal status and sex, 10) Other debtors/guarantors, 11) Present residence since, 12)  Property,  13)  Age  in  years,  14)  Other  instalment  plans  ,  15)  Housing,  16)  Number  of existing credits at this bank, 17) Job, 18) Number of people being liable to provide maintenance for, 19) Telephone and 20) Foreign worker  (see Johnson and Wichern (2002)). Essentially, then we must develop a function of several variables that allows us to classify a new applicant into one of two categories good or bad credit risk.

We developed a classification procedure using decision tree and a hybrid approach of combining the decision tree and neural networks approaches by using 90% of data for modelling and remaining 10% of data (Holdout data) for validation of the mode

Classification using DT

Classification of good or bad credit risk using decision tree technique is discussed in this section. CHAID method is used to construct a decision tree for the classification of good or bad credit risk based on all the 20 explanatory variables. The decision tree contains the total 15 nodes and resulting terminal nodes are 9. CHAID analysis includes the following independent variables with minimum classification risk are: 1) Status of existing checking account, 2) Credit history, 3) Property, 4) Personal status and sex, and 5) Other instalment plans. The resultant classification for training is 74.30% and for testing, it is 72.30%

Classification using Proposed Method

From the above decision tree approach, it is observed that there are nine significant interactions (terminal nodes) emerged to classify good or bad credit risk and the 9 interaction effects are as follows: X1=Status of existing checking account is >= 200DM. X2=the  combination  of  existing  checking  account  <0DM  and  Critical  account/other  credits existing (Not at this bank). X3=the combination of account <0DM and other than critical account. X4=The combination account in between 0 to 200 DM and Real estate. X5= the combination of checking account is in between 0-200DM, property other than real estate and personal status is male and single. X6= the combination of checking account is in between 0-200DM, property other than real estate and female divorced/separated. X7= the combination of no checking account, no instalment plans and credit history. X8=  the  combination  of  no  checking  account,  no  instalment  plans  and  other  than  critical account. X9= the combination of no checking account, other instalment plans, banks/stores.

We built a feed forward  neural  networks  model  for  the  classification  of  credit  risk  using  the above  mentioned  nine  non-overlapping  categories.  (Terminal nodes of the  decision  tree).  We divided the  total  sample  into  three  sets  as  training,  testing  and  holdout set and got the classification accuracy of  74.5% and 78%.

Conclusion

From  the  above  study,  it  is  clear  that  the  proposed  hybrid  approach performs better than the decision tree method. It is also evident that the hybrid approach can perform well with the limited number of variables and which makes the decision maker to concentrate on the combinations (interactions) of these key variables that are identified under decision tree.  This hybrid approach gives a mathematical function for the classification of good or bad credit risk whereas the decision trees fails.

 

Venugopala Rao Manneni

A doctor in statistics from Osmania University. I have been working in the fields of data analysis and research for the last 14 years. My expertise is in data mining and machine learning – in these fields I’ve also published papers. I love to play cricket and badminton.

More Posts

Leave a Reply