When I started work about 3 months ago, I didn't know much more than loading data and executing standard Econometric commands in R. But now I feel much much much more confident in using R for work, for research, for puzzles, and sometimes just for fun. I have learnt a lot of R, statistics, and real world application of Econometrics than I ever did at school (this in spite of the fact that I graduated from a school which has a very strong foundation in the field).
Much R that I have learnt has come from various R enthusiasts who have shared their knowledge and creativity of this elegant language over the internet. And as I grow with R, I found it quite compelling to share my learning journey as well. So, I have decided to describe some of the modelling techniques that are available in R. Along with this, I shall keep giving a few insights regarding the foundation and the need for the technique applied. Since all the techniques are quite big for one blog post, I have decided to divide entire journey into small, channelized, understandable and presumably fun exercises that will help the reader navigate through the models easily and efficiently.
For illustrating these techniques, I have taken the quite popular German credit data set which can be downloaded from here and here. Additionally, most of the techniques that I describe here are taken from this excellent guide. These blog posts can be thought of an extension to this guide. The German credit data has 1000 rows and 21 columns including the dependent variable, which in this case is binary- 1 means "good credit" and 2 means "bad credit". We need to predict whether a given case example will be a "good credit" or a "bad credit".
Now, there are a lot of techniques that can be used for such cases which have a dichotomous dependent variable. We will go through some of them in the following posts. For starters, here we will just try to import the data successfully and try to understand what kind of variables do we have.
###########================================================###############
# Section 1: Data preparation #
###########================================================###############
##########----------------1.1: Reading the data------------###############
Set the directory to where the german data file is located
setwd("D:/Softwares/R/Training/")
Read the data by importing the csv file
g.data <- read.csv("german.data.csv", header = F)
Note that by default the read.csv file has header the option set to TRUE. In this case however, since we don't have the variable names and hence the header row, we need to change the default option and set it to FALSE otherwise the first row of data will be taken as the header row.
# There is a slight mistake in the above code. I came to realize this when it was pointed out in one of the comments. As it happens, the data set available at the two links above is in a "space" delimited format. So the usual csv command will not work. I imported it as a csv because I had converted the tab delimited format to csv using a spread sheet (quite "uncool" I know, but I didn't know much R then). Anyway, thanks to another comment, we can use this command to import the file.
g.data <- read.delim("german.data",header=F,sep=" ")
# There is a slight mistake in the above code. I came to realize this when it was pointed out in one of the comments. As it happens, the data set available at the two links above is in a "space" delimited format. So the usual csv command will not work. I imported it as a csv because I had converted the tab delimited format to csv using a spread sheet (quite "uncool" I know, but I didn't know much R then). Anyway, thanks to another comment, we can use this command to import the file.
g.data <- read.delim("german.data",header=F,sep=" ")
It's always good practice to check the first few observations and see that the data were read in correctly.
head(g.data)
# "head" displays the first six observations
We can and should also check the number of rows of data and the number of variables present.
dim(g.data)
Now, in order to make our analysis easier to understand and implement, it is advisable to identify all the variables by specifying the column names. The column names can be found in the data dictionary.
names(g.data) <- c("chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "response")
head(g.data)
# The "names" function gives the column names of the data frame. Here we are assigning names using this function in a serial manner.
Well, if I may use some corporate jargon, it is advisable to do a DQ and DI check. For starters, it is important to find out the characteristic of all the variables present in the data. We can start with checking the type for each variable. For variables that we expect to be numeric or factor we can check them as
is.numeric(g.data$property)
is.factor(g.data$property)
is.numeric(g.data$age)
is.double(g.data$amount)
is.numeric(g.data$amount)
This process becomes quite a tedious exercise if we have a large number of variables R can make the task easier for us.
str(g.data)
# "str" will compactly display the structure of any R object
Additionally, to make the code more legible and easier to write, we can attach the object "g.data".
Then we can refer to g.data$amount simply as amount. But we need to ensure that there are not more than one objects with same name, i.e, if you have say two data sets data1 and data2, and both the data sets have the variable "amount", then if we attach both the data sets and call "amount", then "amount" will refer to the former object we attached.
attach(g.data)
head(amount)
Well, this completes the data import. We have successfully manged to import the data and correctly identify all the variables present. We will continue playing with data in the next post.