Sunday, 2 October 2011

Modelling with R: part 2

I apologize for the delay in the second post (just in case anybody was waiting), I had been vary involved with work the past week. I shall try to be more regular. Well, in the previous post, we successfully imported data into R and got a basic "feel" of it by looking at the various variables present and their types as well. Now we will try to process the data to make sense of it. Data by themselves are just space hogging particles, aesthetically challenged, and practically worthless... unless... well... unless... we can get some information out of them. And to get that information, they need to be processed, transformed, and at times coerced. This post will describe how we can start to do that. So, let's grab its throat and make it spit out the ugly truth (excuse me for the histrionics). Also, this last sentence bore no relation this horrible Gerad Butler movie.

##########----------------1.2: Processing the data------------###############

 There are three main steps involved here:
      1. Preliminary visulaization
      2. Data transformation and/or variable creation
      3. Development-validation division of data set

Let's start with 1.
Suppose we want to check the distribution of amount in the given data. We can use a simple plot command.
plot(amount, type = "l", col = "royalblue")
plot(age, type = "l", col = "brown")


Now the plots we have created present the variation in these variables in a manner which neither easily discernible nor is it clean.
It would be better if created histograms to check the frequency distribution of these variables.
hist(amount)
hist(age)
# The "hist" command has a lot of options that help extend the features of the plot.

Now, with the hist command, we are able to see the picture clearly (literally). But it is not always appropriate to plot a histogram. What is the best way depends largely on the problem at hand. Suppose we had a multiple time series and we wanted to check the behaviour, then it would be better to use the plot command which will not only present the data in a much neater way but also enable us to compare different time series in a single plot. For a simple example, you can check Shreyes' post.

Coming back, we can similarly visualize the pattern, frequency and distributions of other variables as well.

Now, in case you have ever worked on a credit scoring exercise before, you might have heard that it is better to create categories out continuous variables. This helps a lot while implementing the model that we build because it is more convenient to come with strategies for individuals belonging to a particular income group rather than for all individuals with specific incomes.

For this we need to bin some variables like amount and age. One approach to do this is to run the following code
DO NOT run this chunk of code. I will explain later why.                    
# g.data$amount <- as.factor(ifelse(amount <= 2500, "0-2500", 
                                  ifelse(amount <= 5000, "2600-5000", "5000+")))
Here we are creating three categories for the variable "amount". One for those with income level less than 2500, one for income between 2500 and 5000, and one for income greater than 5000.

There is an important point to note here. Above, while creating the category variable, we overwrote the original variable "amount" in the R object "g.data". Ideally this process is not well advised because if we later find that there was an error in our code or there was some flaw in the logic and we need to change it, we will have to re-do all the steps that we have done till this point. But, there is another side here as well. R, while working, stores all the data and the objects that we create in the RAM and hence if the data set is of considerably large size then creating additional variables by transformation is not a very wise idea either. This trade-off  needs to be balanced.

In this case, the data set is quite small and hence it would be better if we create an additional
object instead of overwriting the original one.

g.data$amt.fac <- as.factor(ifelse(amount <= 2500, "0-2500", 
                                  ifelse(amount <= 5000, "2600-5000", "5000+")))
head(g.data$amt.fac)

Similarly, we can do so for "age".
g.data$age.fac <- as.factor(ifelse(age<=30, '0-30', ifelse(age <= 40, '30-40', '40+')))
head(g.data$age.fac)

Here our dependent variable is "response". It is a factor variable and has "1" and "2" as the factor levels. Now, R by itself can handle factor variables and so we do not need to transform them unless we plan to combine categories. But I like to keep the response category coded as "1" (this is just because of habit and nothing else). Hence, I reassign the levels to "0" and "1".
g.data$default <- as.factor(ifelse(response == 1, "0", "1"))
is.factor(g.data$default)
contrasts(g.data$default)
head(g.data$default)

We attach the data again to include the newly created variables.
attach(g.data)

In the previous post, one the comments introduced me to the with() command as a substitute for the attach(). The with() command also serves the purpose quite well. It reduces the pain of writing the object name with the $ sign before we can refer to a variable but it needs to be included for every operation that we perform on the object. 

Now, we saw that there are a lot of categorical variables present in the data. R provides many functions to plot categorical data.

Let's see an example.
mosaicplot(default ~ age.fac, col = T)
mosaicplot(default ~ job, col = T)
mosaicplot(default ~ chk_acct, col = T)

We can also use a spine plot.
spineplot(default ~ age.fac)

We can also check the relations between variables
library(lattice)
xyplot(amount ~ age)

In case you don't have the "lattice" library installed, you can download it by running
install.packages("lattice")

We can also condition on a variable and see the interaction
xyplot(amount ~ age | default)

"lattice" package also has the option for a barchart and it lets you plot the barchart and a histogram for factor variable type as well.
barchart(age.fac, col = "grey")
barchart(amt.fac, col = "grey")

histogram(employment, col = "grey")
histogram(sav_acct, col = "grey")


As a last step in this stage, we need to create a development sample and a validation sample. We take about 70% percent of the data as development sample and 30% as validation sample.
d <- sort(sample(nrow(g.data), nrow(g.data)*0.7))
# The sample command here creates a random sample of the number of rows in "g.data" and then
# selects 70% of this random sample and assigns it to object "d".
Note that here the sample is being generated from the row numbers and not the exact rows of data
so that if you see the object "d", you will see 700 randomly selected natural numbers between 1 and 1000 which are nothing but the row numbers in the data frame "g.data".
# The "sort" command in the beginning just sorts these randomly generated row numbers in an ascending order.

 Then to create the development sample, we use the vector properties of R and assign the "d" rows to the R object "dev", and the remaining to the R object "val". 
dev<-g.data[d,]
val<-g.data[-d,]

After creating the sample, we can check the size of the two samples vis-a-vis the original data.
dim(g.data)
dim(dev)
dim(val)     

Finally we have been able to domesticate the data. We have sliced and diced them according to our needs. In the next post we will try to cook them in the modelling pan. 

Friday, 30 September 2011

Modelling with R: part 1

When I started work about 3 months ago, I didn't know much more than loading data and executing standard Econometric commands in R. But now I feel much much much more confident in using R for work, for research, for puzzles, and sometimes just for fun. I have learnt a lot of R, statistics, and real world application of Econometrics than I ever did at school (this in spite of the fact that I graduated from a school which has a very strong foundation in the field).

Much R that I have learnt has come from various R enthusiasts who have shared their knowledge and creativity of this elegant language over the internet. And as I grow with R, I found it quite compelling to share my learning journey as well. So, I have decided to describe some of the modelling techniques that are available in R. Along with this, I shall keep giving a few insights regarding the foundation and the need for the technique applied. Since all the techniques are quite big for one blog post, I have decided to divide entire journey into small, channelized, understandable and presumably fun exercises that will help the reader navigate through the models easily and efficiently.

For illustrating these techniques, I have taken the quite popular German credit data set which can be downloaded from here and here. Additionally, most of the techniques that I describe here are taken from this excellent guide. These blog posts can be thought of an extension to this guide. The German credit data has 1000 rows and 21 columns including the dependent variable, which in this case is binary- 1 means "good credit" and 2 means "bad credit". We need to predict whether a given case example will be a "good credit" or a "bad credit". 

Now, there are a lot of techniques that can be used for such cases which have a dichotomous dependent variable. We will go through some of them in the following posts. For starters, here we will just try to import the data successfully and try to understand what kind of variables do we have.



###########================================================###############
#                                                     Section 1: Data preparation                                                      #
###########================================================###############


                   ##########----------------1.1: Reading the data------------############### 


Set the directory to where the german data file is located
setwd("D:/Softwares/R/Training/")

Read the data by importing the csv file 
g.data <- read.csv("german.data.csv", header = F)
Note that by default the read.csv file has header the option set to TRUE. In this case however, since we don't have the variable names and hence the header row, we need to change the default option and set it to FALSE otherwise the first row of data will be taken as the header row.

# There is a slight mistake in the above code. I came to realize this when it was pointed out in one of the comments. As it happens, the data set available at the two links above is in a "space" delimited format. So the usual csv command will not work. I imported it as a csv because I had converted the tab delimited format to csv using a spread sheet (quite "uncool" I know, but I didn't know much R then). Anyway, thanks to another comment, we can use this command to import the file.
g.data <- read.delim("german.data",header=F,sep=" ")


It's always good practice to check the first few observations and see that the data were read in correctly.
head(g.data)
# "head" displays the first six observations


We can and should also check the number of rows of data and the number of variables present.
dim(g.data)


Now, in order to make our analysis easier to understand and implement, it is advisable to identify all the variables by specifying the column names. The column names can be found in the data dictionary.
names(g.data) <- c("chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "response")
head(g.data)
# The "names" function gives the column names of the data frame. Here we are assigning names using this function in a serial manner.

Well, if I may use some corporate jargon, it is advisable to do a DQ and DI check. For starters, it is important to find out the characteristic of all the variables present in the data. We can start with checking the type for each variable. For variables that we expect to be numeric or factor we can check them as
is.numeric(g.data$property)
is.factor(g.data$property)

is.numeric(g.data$age)

is.double(g.data$amount)
is.numeric(g.data$amount)

This process becomes quite a tedious exercise if we have a large number of variables R can make the task easier for us.
str(g.data)
# "str" will compactly display the structure of any R object

Additionally, to make the code more legible and easier to write, we can attach the object "g.data".
Then we can refer to g.data$amount simply as amount. But we need to ensure that there are not more than one objects with same name, i.e, if you have say two data sets data1 and data2, and both the data sets have the variable "amount", then if we attach both the data sets and call "amount", then "amount" will refer to the former object we attached. 
attach(g.data)
head(amount)


Well, this completes the data import. We have successfully manged to import the data and correctly identify all the variables present. We will continue playing with data in the next post.     

Tuesday, 27 September 2011

Project Euler: problem 6


The sum of the squares of the first ten natural numbers is,
12 + 22 + ... + 102 = 385
The square of the sum of the first ten natural numbers is,
(1 + 2 + ... + 10)2 = 552 = 3025
Hence the difference between the sum of the squares of the first ten natural numbers and the square of the sum is 3025 - 385 = 2640.
Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

This is one quite simple.

# Create a vector with the first hundred natural numbers

x.1 <- (1:100)
# Square each element of the vector of natural numbers, i.e, square each natural number and store in another vector
y.1 <- x.1^2
head(y.1)
# Sum the first hundred natural numbers and the squares of each of the first hundred natural numbers and take the difference of these sums
a.1 <- sum(y.1)
b.1 <- (sum(x.1))^2
b.1 - a.1
b.1


Answer: 25164150



After solving a couple of these problems, and after reading some solutions posted by aatrujillo here and here, I realize that my solutions are not general, i.e., they only cater to the problem at hand and hence their scope is very limited. For example, consider the above problem. Had the question asked to do the same analysis on the first 200 natural numbers, I would have to rewrite the entire loop again. I understand that in this case it does not involve much more than changing the size of the x.1 vector, but for a problem that involves more than one loop, it seems to be very "uncool". As a result, I have decided to orient my results towards general solutions and then solve the problem by specifying the parameters. Let's see how that goes. :)



Wednesday, 21 September 2011

Simple time series plot using R : Part 1


As a task for my Financial eco assignment I had to plot a simple time series of the overnight MIBOR(Mumbai interbank offer rates) for the past one year . The job could very well have been done easily in MS-Excel but I choose to plot it in R instead and the quality of the graph, pixel-wise and neatness wise, was way better than what I could have obtained with MS-Excel. All this at the cost of a minimal 3 lines of code:

# The overnight MIBOR rates were stored in a file name "Call_Rates_2011.csv", this is just a normal Excel file saved in a CSV(comma separated delimited) format that R can read.
# The way in which R conceptualizes the data is similar to that in Excel, to draw a simple analogy you can assume that the variable "a" now stores the entire Excel spreed sheet in it.
# You will have to make sure that the working directory is the one that contains the file "Call_Rates_2011.csv"

a <- read.csv("Call_Rates_2011.csv")

# The 2 column headers in my CSV file were "date" and "mibor", so the below code plots "date" on the x-axis and "mibor" on the y-axis. The as.Date() tells R that the column "date" contains dates in the format "day-month-year"('%d-%b-%y').
# a$(column header) is the standard way of referring to a column in the "spreadsheet" contained in "a"
# xlab : x-axis label
# ylab : y- axis label
# type : line(l)
# col : color of the line

plot(as.Date(a$date,'%d-%b-%y'), a$mibor, xlab= "Months", ylab= "MIBOR overnight rates(percentage)", type='l', col='red')

# This is to get the titles in place
# main : main title
# col.main : color of the main title
#font.main : font size of the title

title(main="Overnight MIBOR rates for last one year", col.main="black", font.main=4)

And the plot hence obtained thus looks like:



Incase you can't make out the difference in the quality of the plot obtained just drop in your comments and email address and I will mail you the pdf and the jpg image of the plot. You can pull/stretch it to see that the pixels don't get distorted and it looks way neater if you present it in your slide in a presentation.

Project Euler: problem 3


The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143 ?


This one was quite easy, and much easier in R as it turns out.

The GNU Multi-Precision Library (GMP) is available as a package in R. So the only thing I had to do is install the library. Rest... well... not much...

library(gmp)
factorize(600851475143) 

# The factorize function lists down all the prime factors of the number in the parentheses in ascending order.


Answer: 6857


Well, I understand I did not do much here than just being aware of something called the GMP. Don't blame me, blame them or him. But in due respect for the only concrete language known to humankind (Math not R), I shall try to come up with a more genuine approach.

Saturday, 17 September 2011

Introduction to Beamer

A friend of mine, who is quite smart by the way (she is a PhD. student in Physics at Cambridge), recently asked me for some help with Beamer. Well most of my knowledge and code came from Utkarsh when I had started about a year ago. Initially, I had just asked him for some help with Ubuntu, then a little bit with R, and then some with LaTeX, followed a bit by Python. This was all accompanied with constant questions from my side regarding mathematics, programming, web scarping, regular expressions etc. Most of the times he ended up writing the entire code for me. Though this was more or less like spoon-feeding a kid, he made it a point to ensure that I understood what the code was doing. After writing a lot of code for me and fixing a lot of bugs in the code that I wrote, I finally started to get some confidence while writing my own scripts. A year later I realize that much of teaching much did not emphasize that I get the "grammar" of script right, but that I got the "approach" of the script right. To be more clear, it's more important to know exactly what we want are trying achieve through the code rather than the way to do it. If you know what you want, you'll figure out a way to get it.

Well, when my friend asked for some help with Beamer, I didn't know how to exactly go about it. Should I forward her some documentation that I have stored in my computer which I never read myself? Well, I did do that. But the idea was a bit discomforting. Also, all the time that Utkarsh spent with an idiot like me should not go waste. So I tried to recollect some of the things that he taught me, scrapped through some of my beamer presentations that I had made in college and organized it into a TeX document.

Hope it helps...

% This is a very humble attempt to introduce fellow students to amazing world of LaTeX, and more specifically Beamer.
% This by no means is a guide or a reference manual for Beamer. It is simply a beginning. If you are planning your first presentation, this should serve the purpose and at the same time arouse curiosity to delve deeper.
% Also, I would like to point out that most of the code here has been acquired from various secondary sources and are not
% my creation. I have just assembled some information that I have into a organized document. I sincerely thank all those creative individuals you were generous enough to share their solutions with the world.

\documentclass[]{beamer}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{tabls}
\usepackage{booktabs}
\usepackage{float}
\usepackage{array}
\usepackage[english]{babel}
\usepackage{hyperref}
\usepackage{subfigure} % This package is a very neat way to accommodate more than one graphic on one slide. A sample command is included for reference.

\newenvironment{wideitemize}{\itemize\addtolength{\itemsep}{10pt}}{\enditemize}
% Typically, the itemize enviroment does not provide adequate space between bullet points
% and hence can look a bit cluttered at times. The new "wideitemize" problem solves this issue.
% To use it just put your regular "\item" between "\begin{wideitemize}" and "\end{wideitemize}"

\hypersetup{% This can be left blank if information is not available
pdfauthor = {Author's name},
pdftitle = {Title for the pdf file},
pdfsubject = {},
pdfkeywords = {},
pdfcreator = {LaTeX with hyperref package},
pdfproducer = {dvips + ps2pdf}}

\mode<presentation> % This is a representative list of themes available is Beamer.        Shreyes prefers the Warsaw theme,  I prefer Pittsburg; it's much neater. 
                                   Also, CambridgeUS is quite popular among academics.
     %\usetheme{Warsaw}
     %\usetheme{Rochester}
     %\usetheme{Madrid}
     %\usetheme{Pittsburgh}
     %\usetheme{Antibes}
     %\usetheme{Montpellier}
     %\usetheme{Berkeley}
     %\usetheme{PaloAlto}
     %\usetheme{Goettingen}
     %\usetheme{Marburg}
     %\usetheme{Hannover}
     %\usetheme{Berlin}
     %\usetheme{Ilmenau}
     %\usetheme{Dresden}
     %\usetheme{Darmstadt}
     %\usetheme{Frankfurt}
     %\usetheme{Singapore}
     %\usetheme{Szeged}
     %\usetheme{Copenhagen}
     %\usetheme{Malmoe}
      \usetheme{CambridgeUS}
      \setbeamercovered{transparent = 28}
}


\usefonttheme{professionalfonts}
\usecolortheme[rgb={0.01,0.18,0.42}]{structure}
\useinnertheme{rounded}


\title{Sample Presentation}
\author{Author} \institute{My Institute}

\setbeamertemplate{footline}[frame number] 
% This shows the slide number at the bottom of the page. May or may not be required depending on the theme you use.

\beamertemplatenavigationsymbolsempty 
% This disables the navigation button in the lower right corner of the generated pdf file.


\begin{document}
\begin{frame} % Just so we are clear, a frame here means a slide.
\titlepage
\end{frame}


\AtBeginSection[] % This will display the list of contents before the beginning of each section
{
\begin{frame}
      \frametitle{Outline}
      \tableofcontents[currentsection]
\end{frame}
}


\begin{frame} % You can use this if you just want the list of contents to be displayed once at the beginning.
      \tableofcontents
\end{frame}


\section{Introduction}


\begin{frame}
      \frametitle{A simple sample frame}
            \begin{wideitemize}
                 \item Point 1
                 \pause
                 \item Point 2
                 \pause
                 \item Notice the space between the points.
                 \pause
                 \item Comparatively, more generous than the    normal ``itemize" enviroment
                 \pause
                \item Also, the sequential appearance of points is handeled with \textbackslash pause command.
            \end{wideitemize}
\end{frame}


\section{Part I}


\begin{frame}{Frames you can replicate} % You can include the frame title here as well.
      \begin{itemize}
            \item 1
            \item 2
            \item 3
            \item 4
            \item 5
      \end{itemize}
\end{frame}


\begin{frame}{Another one you can replicate} % This one uses wideitemize
      \begin{wideitemize}
            \item 1
            \item 2
            \item 3
            \item 4
            \item 5
      \end{wideitemize}
\end{frame}


\begin{frame}{Nested environments}
      \begin{wideitemize}
             \item This is wideitemize environment
             \item You can nest environments to formulate a 
hierarchical structure
              \item See below

                   \begin{itemize}
                       \item This is itemize
                       \pause
                       \item Notice the space between bullets
                       \pause
                       \item The bullets can be changed to number using the enumerate environment
                       \pause
                       \item \LaTeX is cool
                       \pause
                       \item I am glad you chose to give it a
thought
                       \pause
                       \item It will change the way you work
                   \end{itemize}

      \end{wideitemize}
\end{frame}


\section{Part II}


\begin{frame}{subfigure example}
      \begin{figure}
            \centering
            \subfigure[Figure 1]{
            \includegraphics[scale=0.4]{name_of_figure1_file.pdf}
            \label{model1}}  % Try putting labels for every figure, table, and equation. It will become easier to navigate if the document is huge.           
            \subfigure[Figure 2]{
            \includegraphics[scale=0.4]{name_of_figure2_file.pdf}
            \label{model2}}
      \end{figure}
\end{frame}


\begin{frame}{Math in \LaTeX}
      \begin{enumerate} 
          \item Use the ``align" or ``gather" environment for math
             \item They are neater
             \item You can also use the ``equation" and ``eqnarray" envrionments
             \item Also, notice that this is the enumerate environment
             \item Hence the numbers instead of bullets
       \end{enumerate}
\end{frame}


\centering
     \frame[plane]{
           \vspace{2cm}
                  {\huge Thank you} % Huge makes the text size, well, Huge
           \vspace{3cm}
           \begin{flushright}
                  Author name and contact
           \end{flushright}
}


\end{document}  
   

Friday, 16 September 2011

Project Euler: problem 2




Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:
1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.



# Inititae a vector x with two values 1 and 2, the starting points for the Fibonacci series 
x <- c(1,2)
length(x)


# Take an object "i", with a starting value of 1. 
# This object will be used to as an index for the vector "x". 
# We continue to add# the (n - 1)th term  and the (n - 2)th term
# to get the nth term. 
# This process continues as long as an element of vector x with 
# index value "i" just crosses the 4,000,000 mark.
i <- 1
while (x[i] < 4000000){i <- i + 1
                        x.index <- length(x)
                        x[x.index + 1] <- x[x.index] + x[x.index - 1]}
x

# Sum the even values of the Fibonacci series thus obtained
sum(x[x %% 2 == 0])


Answer : 4613732