I apologize for the delay in the second post (just in case anybody was waiting), I had been vary involved with work the past week. I shall try to be more regular. Well, in the previous post, we successfully imported data into R and got a basic "feel" of it by looking at the various variables present and their types as well. Now we will try to process the data to make sense of it. Data by themselves are just space hogging particles, aesthetically challenged, and practically worthless... unless... well... unless... we can get some information out of them. And to get that information, they need to be processed, transformed, and at times coerced. This post will describe how we can start to do that. So, let's grab its throat and make it spit out the ugly truth (excuse me for the histrionics). Also, this last sentence bore no relation this horrible Gerad Butler movie.
##########----------------1.2: Processing the data------------###############
There are three main steps involved here:
1. Preliminary visulaization
2. Data transformation and/or variable creation
3. Development-validation division of data set
Let's start with 1.
Suppose we want to check the distribution of amount in the given data. We can use a simple plot command.
plot(amount, type = "l", col = "royalblue")
plot(age, type = "l", col = "brown")
Now the plots we have created present the variation in these variables in a manner which neither easily discernible nor is it clean.
It would be better if created histograms to check the frequency distribution of these variables.
# The "hist" command has a lot of options that help extend the features of the plot.
Now, with the hist command, we are able to see the picture clearly (literally). But it is not always appropriate to plot a histogram. What is the best way depends largely on the problem at hand. Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better to use the plot command which will not only present the data in a much neater way but also enable us to compare different time series in a single plot. For a simple example, you can check Shreyes' post.
Coming back, we can similarly visualize the pattern, frequency and distributions of other variables as well.
Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it is better to create categories out continuous variables. This helps a lot while implementing the model that we build because it is more convenient to come with strategies for individuals belonging to a particular income group rather than for all individuals with specific incomes.
For this we need to bin some variables like amount and age. One approach to do this is to run the following code
DO NOT run this chunk of code. I will explain later why.
# g.data$amount <- as.factor(ifelse(amount <= 2500, "0-2500",
ifelse(amount <= 5000, "2600-5000", "5000+")))
Here we are creating three categories for the variable "amount". One for those with income level less than 2500, one for income between 2500 and 5000, and one for income greater than 5000.
There is an important point to note here. Above, while creating the category variable, we overwrote the original variable "amount" in the R object "g.data". Ideally this process is not well advised because if we later find that there was an error in our code or there was some flaw in the logic and we need to change it, we will have to re-do all the steps that we have done till this point. But, there is another side here as well. R, while working, stores all the data and the objects that we create in the RAM and hence if the data set is of considerably large size then creating additional variables by transformation is not a very wise idea either. This trade-off needs to be balanced.
In this case, the data set is quite small and hence it would be better if we create an additional
object instead of overwriting the original one.
g.data$amt.fac <- as.factor(ifelse(amount <= 2500, "0-2500",
ifelse(amount <= 5000, "2600-5000", "5000+")))
Similarly, we can do so for "age".
g.data$age.fac <- as.factor(ifelse(age<=30, '0-30', ifelse(age <= 40, '30-40', '40+')))
Here our dependent variable is "response". It is a factor variable and has "1" and "2" as the factor levels. Now, R by itself can handle factor variables and so we do not need to transform them unless we plan to combine categories. But I like to keep the response category coded as "1" (this is just because of habit and nothing else). Hence, I reassign the levels to "0" and "1".
g.data$default <- as.factor(ifelse(response == 1, "0", "1"))
We attach the data again to include the newly created variables.
In the previous post, one the comments introduced me to the with() command as a substitute for the attach(). The with() command also serves the purpose quite well. It reduces the pain of writing the object name with the $ sign before we can refer to a variable but it needs to be included for every operation that we perform on the object.
Now, we saw that there are a lot of categorical variables present in the data. R provides many functions to plot categorical data.
Let's see an example.
mosaicplot(default ~ age.fac, col = T)
mosaicplot(default ~ job, col = T)
mosaicplot(default ~ chk_acct, col = T)
We can also use a spine plot.
spineplot(default ~ age.fac)
We can also check the relations between variables
xyplot(amount ~ age)
In case you don't have the "lattice" library installed, you can download it by running
We can also condition on a variable and see the interaction
xyplot(amount ~ age | default)
"lattice" package also has the option for a barchart and it lets you plot the barchart and a histogram for factor variable type as well.
barchart(age.fac, col = "grey")
barchart(amt.fac, col = "grey")
histogram(employment, col = "grey")
histogram(sav_acct, col = "grey")
As a last step in this stage, we need to create a development sample and a validation sample. We take about 70% percent of the data as development sample and 30% as validation sample.
d <- sort(sample(nrow(g.data), nrow(g.data)*0.7))
# The sample command here creates a random sample of the number of rows in "g.data" and then
# selects 70% of this random sample and assigns it to object "d".
Note that here the sample is being generated from the row numbers and not the exact rows of data
so that if you see the object "d", you will see 700 randomly selected natural numbers between 1 and 1000 which are nothing but the row numbers in the data frame "g.data".
# The "sort" command in the beginning just sorts these randomly generated row numbers in an ascending order.
Then to create the development sample, we use the vector properties of R and assign the "d" rows to the R object "dev", and the remaining to the R object "val".
After creating the sample, we can check the size of the two samples vis-a-vis the original data.
Finally we have been able to domesticate the data. We have sliced and diced them according to our needs. In the next post we will try to cook them in the modelling pan.