## Wednesday, 5 October 2011

### Modelling with R: part 3

The previous posts, part 1 and part 2, detailed the procedure to successfully import the data and transform the data so that we can extract some useful information from them. Now it's time to get our hands dirty with some predictive modelling.

The dependent variable here is a binary variable taking values "0" and "1", indicating whether the customer defaulted or not. There are specific, but many, modelling techniques that help us with binary dependent variables. Let's start with the one of the simplest, the Logistic Regression model. And this time, let's just dive straight into it and try to make sense as we go along.

##########-------------2.1: Logistic regression------------###############

m1.logit <- glm(default ~
amt.fac +
age.fac +
duration +
chk_acct +
history +
purpose +
sav_acct +
employment +
install_rate +
pstatus +
other_debtor +
time_resid +
property +
other_install +
housing +
other_credits +
job +
num_depend +
telephone +
foreign
, family = binomial(link = "logit"), data = dev)
# The glm command is for "Generalized Linear Models"
# "~" is the separator between the dependent (or target variable) and the independent variables
# Independent variables are separated by the "+" sign
# "data" requires the data frame in the which the variables are stored
# "family" is used to specify the assumed distribution.

The above code is pretty much self-explanatory. We are running a logistic model on our transformed data. We have specified the "logit" link which means that the estimates (or coefficient or weights) will be for the log of odds ratio and not the direct probabilities. Additionally, note that in case we had to regress the dependent variable, "default" on all the independent variables, we could have simply written
# m1.logit <- glm(default ~ ., family = binomial(link = "logit"), data = dev)
Here the "." indicates all the variables in the data set except for the dependent variable. We are not using the command here is because we have the original "amount" and "age" variables as well as the transformed "amt.fac" and "age.fac" variables as well and we should use only one form, either the continuous form or the factor form but not both.

We can check the output of the model as
summary(m1.logit)

You will see an output that looks like this
Call:
```glm(formula = default ~ amt.fac + age.fac + duration + chk_acct +
history + purpose + sav_acct + employment + install_rate +
pstatus + other_debtor + time_resid + property + other_install +
housing + other_credits + job + num_depend + telephone +
foreign, family = binomial(link = "logit"), data = dev)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.9970  -0.7118  -0.3821   0.6864   2.5393

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)        0.36717    1.28298   0.286  0.77474
amt.fac2600-5000  -0.39145    0.28742  -1.362  0.17322
amt.fac5000+       0.25237    0.40440   0.624  0.53258
age.fac30-40      -0.12406    0.26915  -0.461  0.64486
age.fac40+        -0.40640    0.30975  -1.312  0.18950
duration           0.03454    0.01152   2.998  0.00272 **
chk_acctA12       -0.26207    0.26204  -1.000  0.31725
chk_acctA13       -0.71081    0.43527  -1.633  0.10246
chk_acctA14       -1.66798    0.27841  -5.991 2.08e-09 ***
historyA31         0.51820    0.68122   0.761  0.44684
historyA32        -0.44583    0.54524  -0.818  0.41354
historyA33        -0.36003    0.58949  -0.611  0.54137
historyA34        -0.89606    0.54235  -1.652  0.09850 .
purposeA41        -1.30443    0.44483  -2.932  0.00336 **
purposeA410       -0.68765    0.79700  -0.863  0.38825
purposeA42        -0.50703    0.31109  -1.630  0.10314
purposeA43        -0.95899    0.29629  -3.237  0.00121 **
purposeA44        -1.07335    0.95318  -1.126  0.26013
purposeA45        -0.40166    0.64852  -0.619  0.53569
purposeA46         0.07051    0.48321   0.146  0.88399
purposeA48        -0.79836    1.32004  -0.605  0.54531
purposeA49        -0.63881    0.40523  -1.576  0.11493
sav_acctA62       -0.20567    0.35346  -0.582  0.56064
sav_acctA63       -0.29441    0.47750  -0.617  0.53752
sav_acctA64       -2.03887    0.74864  -2.723  0.00646 **
sav_acctA65       -0.95937    0.32147  -2.984  0.00284 **
employmentA72      0.24498    0.51552   0.475  0.63463
employmentA73      0.01948    0.50334   0.039  0.96913
employmentA74     -0.41741    0.54075  -0.772  0.44016
employmentA75      0.01317    0.50928   0.026  0.97936
install_rate       0.22383    0.10671   2.098  0.03594 *
pstatusA92        -0.71859    0.53450  -1.344  0.17882
pstatusA93        -1.19509    0.52355  -2.283  0.02245 *
pstatusA94        -0.95261    0.60277  -1.580  0.11402
other_debtorA102   0.71012    0.48442   1.466  0.14267
other_debtorA103  -0.84087    0.51928  -1.619  0.10538
time_resid         0.06092    0.10684   0.570  0.56851
propertyA122       0.43538    0.31755   1.371  0.17035
propertyA123       0.45504    0.29062   1.566  0.11740
propertyA124       0.95268    0.52675   1.809  0.07051 .
other_installA142  0.54490    0.47279   1.153  0.24911
other_installA143 -0.45903    0.28664  -1.601  0.10929
housingA152       -0.58427    0.29480  -1.982  0.04749 *
housingA153       -0.95369    0.59035  -1.615  0.10621
other_credits     -0.08412    0.24145  -0.348  0.72753
jobA172            0.51201    0.80390   0.637  0.52419
jobA173            0.56316    0.77175   0.730  0.46556
jobA174            0.09850    0.77019   0.128  0.89823
num_depend         0.38802    0.29445   1.318  0.18758
telephoneA192     -0.19916    0.24766  -0.804  0.42131
foreignA202       -1.35004    0.85752  -1.574  0.11541
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 853.51  on 699  degrees of freedom
Residual deviance: 628.93  on 649  degrees of freedom
AIC: 730.93

Number of Fisher Scoring iterations: 5```

We can see that we have five variables which are significant at 1 percent level of significance and one variable significant at 0.1 percent level of significance. Great! Or not so great! How do we judge how good/bad our model is? Have no fear, R is here.

One of the most popular methods to check the performance of a model, which has a binary independent variable, is by plotting the Receiver Operating Characteristic (ROC) curve which is a plot of the sensitivity, or the true positive rate vs. the false positive rate. In simpler terms, it plots the no. of 1s in the model correctly identified as 1 vs. the no. of 0s in the model which are incorrectly identified as 1.

Also, it makes more sense to plot the ROC curve on the validation sample. For this, we would need to score that validation data set. By "scoring" we mean that with the values of the independent variables given in the validation sample, we will run the model equation on each row of the data set and get the probability of the customer defaulting. Simply, use all the Xs and the Y-hats or the predicted Ys.
val\$m1.yhat <- predict(m1.logit, val, type = "response")
# The predict command runs the regression model on the "val" dataset and stores the estimated  y-values, i.e, the yhat.

library(ROCR)
m1.scores <- prediction(val\$m1.yhat, val\$default)
# The prediction function of the ROCR library basically creates a structure to validate our predictions, "val\$yhat" with respect to the actual y-values "val\$default"

We can then plot the ROC curve by plotting the performance fuction which checks how our model is performing.
plot(performance(m1.scores, "tpr", "fpr"), col = "red")
abline(0,1, lty = 8, col = "grey")
# "tpr" and "fpr" are arguments of the "performance" function indicating that the plot is between the true positive rate and the false positive rate.
# "col" is an argument to the plot function and indicates the colour of the line
# "abline" plots the diagonal, "lty" is the line type which is used to create the dashed line

The ROC curve's performance is judged by the area under the curve formed by the ROC curve and the 45' line. The idea is that if we randomly try to classify the customers who will default, we will get the 45' line. And if we use a predictive approach, then the approach should try to classify all the 1s and 1, i.e, correctly identify all the people who defaulted. Such a curve would then be a right triangle with vertices at (0,0), (0,1), (1,1). But since predictive modelling is usually accompanied with some error, we get a curve like we see in the graph. And the rule of thumb is, more the area under the curve, better the model.

In addition to the ROC curve, the KS statistic is also a popular metric used to judge model performance, the Kolmogorov-Smirnov statistic. It is the maximum difference between the cumulative true positive rate and the cumulative false positive rate

For this let's store the output of the "performance" function above to an R object
m1.perf <- performance(m1.scores, "tpr", "fpr")
ks1.logit <- max(attr(m1.perf, "y.values")[[1]] - (attr(m1.perf, "x.values")[[1]]))
ks1.logit

Well, the ROC shows that the model is not so great but not so bad either. So what do we do to improve our model? Do some more transformations, tweak it a bit, or just try a different technique?

I'll leave this as an open question to make it interactive. Let's try all the "feasible" approaches that people suggest that will help us improve the model.

1. Hi,

I have a problem. I am told

"Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1"

I am using the original "response" as the dependent variable. Do we need to change that to 0:1 from the current 1:2?

Regards,

Rob

2. Hi Rob

I am really sorry for the late reply. I had been travelling for the past 3 days and did not have access to an internet connection.

Anyway, coming back to your problem, we don't need to convert the "response" variable so that it lies between 0 and 1. R will do that for us. But we do need to ensure that the "response" variable is a factor variable and not numeric.

I think you are getting the problem because by default R read the "response" variable to be numeric instead of factor and then it is tried to run a logistic regression on a numeric variable instead of a factor variable. Just change it to factor and you should be good to go. You can use

g.data\$response <- as.factor(g.data\$response)

Hope this helps!

1. Obviously a late reply, but this right here saved me 2 days work. Thanks mate.

2. > str(dry_glm)
'data.frame': 104 obs. of 6 variables:
\$ wt_cu : Factor w/ 2 levels "CU","WT": 2 2 2 2 1 1 1 1 2 2 ...
Hello I am trying to perform a glm() following your example, but I am getting the same errors as Rob. Below is my str(data), can you see the bug? Thank you very much in advance,
G

\$ media : Factor w/ 3 levels "ME","PD","W": 1 1 1 1 1 1 1 1 1 1 ...
\$ yeast : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
\$ peptone : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
\$ type : Factor w/ 2 levels "poor","rich": 2 2 2 2 2 2 2 2 2 2 ...
\$ dry_weight: num 75.3 78.2 76.9 81.5 111.1 ...

3. Hey,

1. A very basic doubt. The place where you listed down all the variables in the model equation. Is there a way in R to get a list of all the variables present in the data set. I mean do we have to manually enter the variable name separated by a "+" sign. In case where there are too many variables, this might be a tedious task. For instance, I have about 200 variables in my model, out of which I want to use say 170; how do we go about it?

2. This is more of a possible solution/comment motivated by one of the projects I m working on!!So, towards the end, you talk about "performance" of the model. I was just wondering if we can draw upon the concepts from very basic Econometrics. As we remember, a simple transformation of variables (squared/root/logistic/cube etc.) can sometimes improve the performance of the model. When we are faced with huge data, it can be a herculean task to use intuition for each variable and decide the form in which the variable enters the model. What we did in one of our projects is designed an automated process (read: macro) such that we could transform each variable and check the best form in which it can enter the model equation. What I want to ask is whether there is a way to automate this kind of a process in R. I m sure there is but forgive my ignorance if this is a very silly question! I hope I have made myself clear in my doubt.

Of course this is a very basic and elementary kind of tweaking that I can think of (and also limited in applicability since it can't be used on categorical variables!). After reading your next post on decision trees, I m convinced that you have many more things up your sleeve to help us resolve the "performace" issues.

I want to commend for the succint and simple rendering of complicated modelling for dummies like myself.

Thanks
Esha

1. Hey Esha,

I think you can use this " ls() " to list all the variables in that data set

ls(g.data)

[1] "age" "age.fac" "amount" "amt.fac" "chkacct"
[6] "default" "duration" "employment" "foreign" "history"
[11] "housing" "installrate" "job" "numdepend" "othercredits"
[16] "otherdebtor" "otherinstall" "property" "pstatus" "purpose"
[21] "response" "savacct" "telephone" "timeresid"

ARVIND

(But i guess by now you would've figured that out)

2. colnames(data)
this will also gives you the list of names of attributes present in dataset

4. Hey Esha

Thanks. Good to know that that somebody is finding the blog useful.

1. We don't necessarily have to list down all the variables manually. Like mentioned in the post as well, in case you wanted all the two hundred variables to be included you can simply write
lm(y ~ .), where "." stands for all the variables except the dependent one.

Other than that if you wanted to use, say, 170 of these variables then there a couple of convenient options. I'll list the ones I know.

A. All 170 variables are arranged in one order, say, from 20 to 190:
lm(y ~ data[, c(20:190)]

B. Some are clubbed together:
lm(y ~ data[, c(1:50, 60:100, 120:200)]

C. No proper order:
This is a little tedious but is still better than writing the variable names. You specify each column that it should take
lm(y ~ data[, c(1,2,4,5,11,17,25)]

2. I'll have to look in to your second question. Will update the post as soon as I find some valuable information.

5. Thanks a lot! That really helps!

6. Followed your example up to:
val\$m1.yhat <- predict(m1.logit, val,type="response")

R gives me the following error:
"Error in terms.default(object, data = data) :
no terms component nor attribute"

I have checked the terms and they are present, so I don't know what's going wrong.

7. Hi MK,
I am facing issues with the attach()
attach(g.data)
The following object(s) are masked from 'g.data (position 3)':

Regards,
Neha

8. Hi MK,
thanks a ton for taking so much efforts and putting this .. love it !
this is really really helpful for a beginner like me..

I ran the whole piece of codes you have written and then this struck when I ran the logistic regression part..
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

I got confused and started searching for an answer, tried googling to understand but it was in vain, and then i stumbled upon this
g.data\$amt.fac <- as.factor(ifelse(amount <= 2500, "0-2500",
ifelse(amount <= 5000, "2600-5000", "5000+")))

if you look at the amount variable, we have the following values
[1] 5 5 5 13 7 11

So when we are actually binning this variable, we are creating a single bin 0-2500 and thats what causes the error.

I changed the bins to 0-5,5-10,10+ and the logistic regression code runs without any problem..

thanks a ton again for all the effort you have put in.. great help !!!!

1. Thank you for your kind words Vaibhav.

I had a look at the data again by reloading it and got the following output with the same command as yours

[1] 1169 5951 2096 7882 4870 9055

I guess there could be an issue where names were assigned to columns. Could you please rerun and check. If you still get an error, please do let me know. I'd be happy to run your script and mine side by side and see where the difference is.

2. ok my bad ! thanks for the clarification .. looks like when I converted the txt file to csv file the columns misaligned.. thanks again !

3. You're welcome :)

9. another thing I would want to know is, is there a way to get the variable univariate/composition before creating bins.. ?

1. Try

summary(g.data\$amount)

10. Hi MK,
I am back with a basic question again :)

How do you do variable selection? suppose I have 100 variables, how do you decide which variables should be input in the logistic step.

1. Hi Vaibhav

Variable selection is quite a complex game and usually depends on the problem at hand. For simple problem using a stepwise algorithm is sufficient. In R, it can be implemented using the stepAIC() function from the MASS package.

To use the function, first run a logisitic regression using all the variables. Then pass the created object through the stepAIC() function. It will return an object with the selected variables.

Hope this helps.

2. Thanks for the prompt response. This should definitely help.. Will give it a try and come back with more :)
thanks a ton again !

11. hai i have one problem, that i have only one independent (continuous) variable and corresponding dependent (categorical) variable. can it be modelled by logit function

12. Hi MK,

Great blog, thanks for this information.

In your example, we have many variables here; however not all are working well. If we decide to build a model of, say, 5 variables, what should we do?

I saw that you already recommended to use Stepwise function in R, however how to treat this for the categorical variables. For some of the dummy variables created, we have good significance (p-value), while for others we do not. What would you recommend to do in R?

Thanks!

1. Hi SY

We can use stepwise to determine which variables are insignificant and this will work for both continuous and dummy variables.

Typically, for a categorical variable with n categories, we put n-1 dummy variables in the regression model to prevent multi-collinearity. The remaining category is then part of the intercept. There are cases where the coefficients of some of the n-1 dummy variables come out to insignificant. We can remove these variables from the model and they will automatically be part of the intercept as well.

That said, stepwise is just one way of selecting features. There are many other ways like regularized regression, genetic algorithms, importance from decision trees to select the best features. Most of these are easily implementable in R.

MK

2. Hi MK,

Thanks for the response. In this case, I have two questions:

1) As per my experience with R, the StepAIC() function eliminates (or adds) variables step by step; however does not specify the categories per each variable. For example, if you have age.fac:0-20, age.fac:20-40 as dummy variables (and 40+ as reference variable), it simply eliminates age.fac in one of the stages if it doesn't fit in the model. Do you think such elimination is done only when none of the dummy variables (categories) work in the model?

2) You mentioned there are other ways for variable selection. It would be great if you can mention how to do these in R (or maybe in another post).

Many many thanks!
SY

13. Could you elaborate on the significance and how to interpret the out of Kolmogorov-Smirnov statistic ?

I have 0.48 as the output. What does that mean ?

14. Hi MK,

I am struck in data preparation were am trying to do binning for the data.

Is there a way were I can convert all the variables to factors at one go. becoz converting each of these variables to factors is taking a long time.

This is the error that am getting when i try to put all the variables in the model

Error in model.frame.default(formula = GCD\$Creditability ~ ., data = trainl, :
variable lengths differ

Thanks.

15. Thanks, I really appreciate your efforts.

can you please give us the predictive equation of logistic regression for this model.