Friday 30 September 2011

Modelling with R: part 1

When I started work about 3 months ago, I didn't know much more than loading data and executing standard Econometric commands in R. But now I feel much much much more confident in using R for work, for research, for puzzles, and sometimes just for fun. I have learnt a lot of R, statistics, and real world application of Econometrics than I ever did at school (this in spite of the fact that I graduated from a school which has a very strong foundation in the field).

Much R that I have learnt has come from various R enthusiasts who have shared their knowledge and creativity of this elegant language over the internet. And as I grow with R, I found it quite compelling to share my learning journey as well. So, I have decided to describe some of the modelling techniques that are available in R. Along with this, I shall keep giving a few insights regarding the foundation and the need for the technique applied. Since all the techniques are quite big for one blog post, I have decided to divide entire journey into small, channelized, understandable and presumably fun exercises that will help the reader navigate through the models easily and efficiently.

For illustrating these techniques, I have taken the quite popular German credit data set which can be downloaded from here and here. Additionally, most of the techniques that I describe here are taken from this excellent guide. These blog posts can be thought of an extension to this guide. The German credit data has 1000 rows and 21 columns including the dependent variable, which in this case is binary- 1 means "good credit" and 2 means "bad credit". We need to predict whether a given case example will be a "good credit" or a "bad credit". 

Now, there are a lot of techniques that can be used for such cases which have a dichotomous dependent variable. We will go through some of them in the following posts. For starters, here we will just try to import the data successfully and try to understand what kind of variables do we have.



###########================================================###############
#                                                     Section 1: Data preparation                                                      #
###########================================================###############


                   ##########----------------1.1: Reading the data------------############### 


Set the directory to where the german data file is located
setwd("D:/Softwares/R/Training/")

Read the data by importing the csv file 
g.data <- read.csv("german.data.csv", header = F)
Note that by default the read.csv file has header the option set to TRUE. In this case however, since we don't have the variable names and hence the header row, we need to change the default option and set it to FALSE otherwise the first row of data will be taken as the header row.

# There is a slight mistake in the above code. I came to realize this when it was pointed out in one of the comments. As it happens, the data set available at the two links above is in a "space" delimited format. So the usual csv command will not work. I imported it as a csv because I had converted the tab delimited format to csv using a spread sheet (quite "uncool" I know, but I didn't know much R then). Anyway, thanks to another comment, we can use this command to import the file.
g.data <- read.delim("german.data",header=F,sep=" ")


It's always good practice to check the first few observations and see that the data were read in correctly.
head(g.data)
# "head" displays the first six observations


We can and should also check the number of rows of data and the number of variables present.
dim(g.data)


Now, in order to make our analysis easier to understand and implement, it is advisable to identify all the variables by specifying the column names. The column names can be found in the data dictionary.
names(g.data) <- c("chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "response")
head(g.data)
# The "names" function gives the column names of the data frame. Here we are assigning names using this function in a serial manner.

Well, if I may use some corporate jargon, it is advisable to do a DQ and DI check. For starters, it is important to find out the characteristic of all the variables present in the data. We can start with checking the type for each variable. For variables that we expect to be numeric or factor we can check them as
is.numeric(g.data$property)
is.factor(g.data$property)

is.numeric(g.data$age)

is.double(g.data$amount)
is.numeric(g.data$amount)

This process becomes quite a tedious exercise if we have a large number of variables R can make the task easier for us.
str(g.data)
# "str" will compactly display the structure of any R object

Additionally, to make the code more legible and easier to write, we can attach the object "g.data".
Then we can refer to g.data$amount simply as amount. But we need to ensure that there are not more than one objects with same name, i.e, if you have say two data sets data1 and data2, and both the data sets have the variable "amount", then if we attach both the data sets and call "amount", then "amount" will refer to the former object we attached. 
attach(g.data)
head(amount)


Well, this completes the data import. We have successfully manged to import the data and correctly identify all the variables present. We will continue playing with data in the next post.     

14 comments:

  1. Waiting for the next Post !! Good job.

    ReplyDelete
  2. R newbie here, thanks for the post.

    While loading a data set some time back, I ran into a problem. I had two data sets, first one contained 1000000 observations of a single variable of integer type. The second data set also contained 1000000 observations of a single variable but of long type. While the first data set loaded correctly, only about 650000 observations of the second data set were loaded. Any idea why it happens? I think this is because R allocates the memory at the beginning of loading and I don't think I have enough memory for 1000000 long type variables.

    ReplyDelete
  3. Hi Debajyoti

    So first, I don't think there are "long" type variables in R. For simple numeric variables, there are two types, integer and real (also known as numeric).

    Additionally, I don't think there should be a problem while trying to import a million observations of a single variable.

    How is the data stored on disk, i.e, in which format? And how are you trying to import it? Can you please share the code?

    ReplyDelete
  4. I have checked again and it seems to be working, The bug was in my data set generating code. Apologies for not being clear. I am generating the data set in VBA that's why long and integer type came up.
    Here is the VBA code -

    Option Base 1
    Option Explicit


    Sub generate()
    Dim total(1000000) As Long
    Dim i As Long
    Dim k As Long

    Randomize
    For i = 1 To 1000000
    total(i) = 0
    Randomize
    For k = 1 To 5
    If Rnd > (5 / 26) Then
    If Rnd <= (8 / 26) Then total(i) = total(i) + 5
    End If

    If Rnd > (7 / 26) Then
    If Rnd <= (8 / 26) Then total(i) = total(i) + 5
    End If


    If Rnd <= (8 / 26) Then total(i) = total(i) + 10


    Next k

    Call Print_Test(total(i)) //write the data in the file
    Next i

    End Sub

    Function Print_Test(output As Long)
    Dim record As String
    Open "D:\data.txt" For Append As #1
    record = "" & output & vbCrLf
    Print #1, record
    Close #1
    End Function

    I am storing it in a text file and loading the data using the Import Dataset option in RStudio. Tt's working now. Thanks for your time!

    ReplyDelete
  5. Dude, don't use attach. It can lead to all manner of problems. Use with() instead.

    Used like so with(dataframe, function(variable)).

    It takes a bit of getting used to, but attach has caused me serious pain in the past

    ReplyDelete
  6. Oh, I didn't know about this command...

    Will definitely check it out and include it in the next post if it serves the purpose better.

    Thank you.

    ReplyDelete
  7. I just realized the indecency in my behaviour.

    Thank you Venki... I appreciate it.

    ReplyDelete
  8. I must be an idiot or something. I saw a couple new things I wanted to try out in this post but am having a hard time finding the data you are using. The data you link to is not in comma delimited format and the german.data-numeric file (which looks like the one you are using) has 25 columns. Could you provide a better link to the data you are using? Thanks.

    ReplyDelete
  9. I got the data to load with the german.data file from the site with the following command

    g.data <- read.delim("german.data",header=F,sep=" ")

    A great article; I'm following the series.
    Cheers
    Brett

    ReplyDelete
  10. Hey Dave... Thanks for bringing this up. I hadn't realized my mistake. I had downloaded the data from a website, which I can't recollect unfortunately, in a csv format. At that time I wasn't didn't know that I was going to put it up online, so I didn't keep a record of the link. :(


    Regarding the files, I am using the "german.data" file (with 21 columns) and NOT the "german.data-numeric" file (with 25 columns). I really don't know what the additional four columns are representing since there in no description about them in the data dictionary. Hope this helps.

    Additionally, for importing the file into R, you can use the command by CrankyMax.

    Kindly let me know if you still have a problem while importing the data.

    ReplyDelete
  11. Thank you CrankyMax...

    The command's quite helpful; will add it to the post.

    ReplyDelete
  12. Just getting back to this. Been busy with finish grad school. Thanks to both of you. I was able to get the data loaded. Thanks for the excellent tutorial!

    ReplyDelete
  13. Can someone please upload the data set... i have tried all the commands and R still cant read the data.

    ReplyDelete