Friday 30 September 2011

Modelling with R: part 1

When I started work about 3 months ago, I didn't know much more than loading data and executing standard Econometric commands in R. But now I feel much much much more confident in using R for work, for research, for puzzles, and sometimes just for fun. I have learnt a lot of R, statistics, and real world application of Econometrics than I ever did at school (this in spite of the fact that I graduated from a school which has a very strong foundation in the field).

Much R that I have learnt has come from various R enthusiasts who have shared their knowledge and creativity of this elegant language over the internet. And as I grow with R, I found it quite compelling to share my learning journey as well. So, I have decided to describe some of the modelling techniques that are available in R. Along with this, I shall keep giving a few insights regarding the foundation and the need for the technique applied. Since all the techniques are quite big for one blog post, I have decided to divide entire journey into small, channelized, understandable and presumably fun exercises that will help the reader navigate through the models easily and efficiently.

For illustrating these techniques, I have taken the quite popular German credit data set which can be downloaded from here and here. Additionally, most of the techniques that I describe here are taken from this excellent guide. These blog posts can be thought of an extension to this guide. The German credit data has 1000 rows and 21 columns including the dependent variable, which in this case is binary- 1 means "good credit" and 2 means "bad credit". We need to predict whether a given case example will be a "good credit" or a "bad credit". 

Now, there are a lot of techniques that can be used for such cases which have a dichotomous dependent variable. We will go through some of them in the following posts. For starters, here we will just try to import the data successfully and try to understand what kind of variables do we have.



###########================================================###############
#                                                     Section 1: Data preparation                                                      #
###########================================================###############


                   ##########----------------1.1: Reading the data------------############### 


Set the directory to where the german data file is located
setwd("D:/Softwares/R/Training/")

Read the data by importing the csv file 
g.data <- read.csv("german.data.csv", header = F)
Note that by default the read.csv file has header the option set to TRUE. In this case however, since we don't have the variable names and hence the header row, we need to change the default option and set it to FALSE otherwise the first row of data will be taken as the header row.

# There is a slight mistake in the above code. I came to realize this when it was pointed out in one of the comments. As it happens, the data set available at the two links above is in a "space" delimited format. So the usual csv command will not work. I imported it as a csv because I had converted the tab delimited format to csv using a spread sheet (quite "uncool" I know, but I didn't know much R then). Anyway, thanks to another comment, we can use this command to import the file.
g.data <- read.delim("german.data",header=F,sep=" ")


It's always good practice to check the first few observations and see that the data were read in correctly.
head(g.data)
# "head" displays the first six observations


We can and should also check the number of rows of data and the number of variables present.
dim(g.data)


Now, in order to make our analysis easier to understand and implement, it is advisable to identify all the variables by specifying the column names. The column names can be found in the data dictionary.
names(g.data) <- c("chk_acct", "duration", "history", "purpose", "amount", "sav_acct", "employment", "install_rate", "pstatus", "other_debtor", "time_resid", "property", "age", "other_install", "housing", "other_credits", "job", "num_depend", "telephone", "foreign", "response")
head(g.data)
# The "names" function gives the column names of the data frame. Here we are assigning names using this function in a serial manner.

Well, if I may use some corporate jargon, it is advisable to do a DQ and DI check. For starters, it is important to find out the characteristic of all the variables present in the data. We can start with checking the type for each variable. For variables that we expect to be numeric or factor we can check them as
is.numeric(g.data$property)
is.factor(g.data$property)

is.numeric(g.data$age)

is.double(g.data$amount)
is.numeric(g.data$amount)

This process becomes quite a tedious exercise if we have a large number of variables R can make the task easier for us.
str(g.data)
# "str" will compactly display the structure of any R object

Additionally, to make the code more legible and easier to write, we can attach the object "g.data".
Then we can refer to g.data$amount simply as amount. But we need to ensure that there are not more than one objects with same name, i.e, if you have say two data sets data1 and data2, and both the data sets have the variable "amount", then if we attach both the data sets and call "amount", then "amount" will refer to the former object we attached. 
attach(g.data)
head(amount)


Well, this completes the data import. We have successfully manged to import the data and correctly identify all the variables present. We will continue playing with data in the next post.     

Tuesday 27 September 2011

Project Euler: problem 6


The sum of the squares of the first ten natural numbers is,
12 + 22 + ... + 102 = 385
The square of the sum of the first ten natural numbers is,
(1 + 2 + ... + 10)2 = 552 = 3025
Hence the difference between the sum of the squares of the first ten natural numbers and the square of the sum is 3025 - 385 = 2640.
Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

This is one quite simple.

# Create a vector with the first hundred natural numbers

x.1 <- (1:100)
# Square each element of the vector of natural numbers, i.e, square each natural number and store in another vector
y.1 <- x.1^2
head(y.1)
# Sum the first hundred natural numbers and the squares of each of the first hundred natural numbers and take the difference of these sums
a.1 <- sum(y.1)
b.1 <- (sum(x.1))^2
b.1 - a.1
b.1


Answer: 25164150



After solving a couple of these problems, and after reading some solutions posted by aatrujillo here and here, I realize that my solutions are not general, i.e., they only cater to the problem at hand and hence their scope is very limited. For example, consider the above problem. Had the question asked to do the same analysis on the first 200 natural numbers, I would have to rewrite the entire loop again. I understand that in this case it does not involve much more than changing the size of the x.1 vector, but for a problem that involves more than one loop, it seems to be very "uncool". As a result, I have decided to orient my results towards general solutions and then solve the problem by specifying the parameters. Let's see how that goes. :)



Wednesday 21 September 2011

Simple time series plot using R : Part 1


As a task for my Financial eco assignment I had to plot a simple time series of the overnight MIBOR(Mumbai interbank offer rates) for the past one year . The job could very well have been done easily in MS-Excel but I choose to plot it in R instead and the quality of the graph, pixel-wise and neatness wise, was way better than what I could have obtained with MS-Excel. All this at the cost of a minimal 3 lines of code:

# The overnight MIBOR rates were stored in a file name "Call_Rates_2011.csv", this is just a normal Excel file saved in a CSV(comma separated delimited) format that R can read.
# The way in which R conceptualizes the data is similar to that in Excel, to draw a simple analogy you can assume that the variable "a" now stores the entire Excel spreed sheet in it.
# You will have to make sure that the working directory is the one that contains the file "Call_Rates_2011.csv"

a <- read.csv("Call_Rates_2011.csv")

# The 2 column headers in my CSV file were "date" and "mibor", so the below code plots "date" on the x-axis and "mibor" on the y-axis. The as.Date() tells R that the column "date" contains dates in the format "day-month-year"('%d-%b-%y').
# a$(column header) is the standard way of referring to a column in the "spreadsheet" contained in "a"
# xlab : x-axis label
# ylab : y- axis label
# type : line(l)
# col : color of the line

plot(as.Date(a$date,'%d-%b-%y'), a$mibor, xlab= "Months", ylab= "MIBOR overnight rates(percentage)", type='l', col='red')

# This is to get the titles in place
# main : main title
# col.main : color of the main title
#font.main : font size of the title

title(main="Overnight MIBOR rates for last one year", col.main="black", font.main=4)

And the plot hence obtained thus looks like:



Incase you can't make out the difference in the quality of the plot obtained just drop in your comments and email address and I will mail you the pdf and the jpg image of the plot. You can pull/stretch it to see that the pixels don't get distorted and it looks way neater if you present it in your slide in a presentation.

Project Euler: problem 3


The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143 ?


This one was quite easy, and much easier in R as it turns out.

The GNU Multi-Precision Library (GMP) is available as a package in R. So the only thing I had to do is install the library. Rest... well... not much...

library(gmp)
factorize(600851475143) 

# The factorize function lists down all the prime factors of the number in the parentheses in ascending order.


Answer: 6857


Well, I understand I did not do much here than just being aware of something called the GMP. Don't blame me, blame them or him. But in due respect for the only concrete language known to humankind (Math not R), I shall try to come up with a more genuine approach.

Saturday 17 September 2011

Introduction to Beamer

A friend of mine, who is quite smart by the way (she is a PhD. student in Physics at Cambridge), recently asked me for some help with Beamer. Well most of my knowledge and code came from Utkarsh when I had started about a year ago. Initially, I had just asked him for some help with Ubuntu, then a little bit with R, and then some with LaTeX, followed a bit by Python. This was all accompanied with constant questions from my side regarding mathematics, programming, web scarping, regular expressions etc. Most of the times he ended up writing the entire code for me. Though this was more or less like spoon-feeding a kid, he made it a point to ensure that I understood what the code was doing. After writing a lot of code for me and fixing a lot of bugs in the code that I wrote, I finally started to get some confidence while writing my own scripts. A year later I realize that much of teaching much did not emphasize that I get the "grammar" of script right, but that I got the "approach" of the script right. To be more clear, it's more important to know exactly what we want are trying achieve through the code rather than the way to do it. If you know what you want, you'll figure out a way to get it.

Well, when my friend asked for some help with Beamer, I didn't know how to exactly go about it. Should I forward her some documentation that I have stored in my computer which I never read myself? Well, I did do that. But the idea was a bit discomforting. Also, all the time that Utkarsh spent with an idiot like me should not go waste. So I tried to recollect some of the things that he taught me, scrapped through some of my beamer presentations that I had made in college and organized it into a TeX document.

Hope it helps...

% This is a very humble attempt to introduce fellow students to amazing world of LaTeX, and more specifically Beamer.
% This by no means is a guide or a reference manual for Beamer. It is simply a beginning. If you are planning your first presentation, this should serve the purpose and at the same time arouse curiosity to delve deeper.
% Also, I would like to point out that most of the code here has been acquired from various secondary sources and are not
% my creation. I have just assembled some information that I have into a organized document. I sincerely thank all those creative individuals you were generous enough to share their solutions with the world.

\documentclass[]{beamer}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{tabls}
\usepackage{booktabs}
\usepackage{float}
\usepackage{array}
\usepackage[english]{babel}
\usepackage{hyperref}
\usepackage{subfigure} % This package is a very neat way to accommodate more than one graphic on one slide. A sample command is included for reference.

\newenvironment{wideitemize}{\itemize\addtolength{\itemsep}{10pt}}{\enditemize}
% Typically, the itemize enviroment does not provide adequate space between bullet points
% and hence can look a bit cluttered at times. The new "wideitemize" problem solves this issue.
% To use it just put your regular "\item" between "\begin{wideitemize}" and "\end{wideitemize}"

\hypersetup{% This can be left blank if information is not available
pdfauthor = {Author's name},
pdftitle = {Title for the pdf file},
pdfsubject = {},
pdfkeywords = {},
pdfcreator = {LaTeX with hyperref package},
pdfproducer = {dvips + ps2pdf}}

\mode<presentation> % This is a representative list of themes available is Beamer.        Shreyes prefers the Warsaw theme,  I prefer Pittsburg; it's much neater. 
                                   Also, CambridgeUS is quite popular among academics.
     %\usetheme{Warsaw}
     %\usetheme{Rochester}
     %\usetheme{Madrid}
     %\usetheme{Pittsburgh}
     %\usetheme{Antibes}
     %\usetheme{Montpellier}
     %\usetheme{Berkeley}
     %\usetheme{PaloAlto}
     %\usetheme{Goettingen}
     %\usetheme{Marburg}
     %\usetheme{Hannover}
     %\usetheme{Berlin}
     %\usetheme{Ilmenau}
     %\usetheme{Dresden}
     %\usetheme{Darmstadt}
     %\usetheme{Frankfurt}
     %\usetheme{Singapore}
     %\usetheme{Szeged}
     %\usetheme{Copenhagen}
     %\usetheme{Malmoe}
      \usetheme{CambridgeUS}
      \setbeamercovered{transparent = 28}
}


\usefonttheme{professionalfonts}
\usecolortheme[rgb={0.01,0.18,0.42}]{structure}
\useinnertheme{rounded}


\title{Sample Presentation}
\author{Author} \institute{My Institute}

\setbeamertemplate{footline}[frame number] 
% This shows the slide number at the bottom of the page. May or may not be required depending on the theme you use.

\beamertemplatenavigationsymbolsempty 
% This disables the navigation button in the lower right corner of the generated pdf file.


\begin{document}
\begin{frame} % Just so we are clear, a frame here means a slide.
\titlepage
\end{frame}


\AtBeginSection[] % This will display the list of contents before the beginning of each section
{
\begin{frame}
      \frametitle{Outline}
      \tableofcontents[currentsection]
\end{frame}
}


\begin{frame} % You can use this if you just want the list of contents to be displayed once at the beginning.
      \tableofcontents
\end{frame}


\section{Introduction}


\begin{frame}
      \frametitle{A simple sample frame}
            \begin{wideitemize}
                 \item Point 1
                 \pause
                 \item Point 2
                 \pause
                 \item Notice the space between the points.
                 \pause
                 \item Comparatively, more generous than the    normal ``itemize" enviroment
                 \pause
                \item Also, the sequential appearance of points is handeled with \textbackslash pause command.
            \end{wideitemize}
\end{frame}


\section{Part I}


\begin{frame}{Frames you can replicate} % You can include the frame title here as well.
      \begin{itemize}
            \item 1
            \item 2
            \item 3
            \item 4
            \item 5
      \end{itemize}
\end{frame}


\begin{frame}{Another one you can replicate} % This one uses wideitemize
      \begin{wideitemize}
            \item 1
            \item 2
            \item 3
            \item 4
            \item 5
      \end{wideitemize}
\end{frame}


\begin{frame}{Nested environments}
      \begin{wideitemize}
             \item This is wideitemize environment
             \item You can nest environments to formulate a 
hierarchical structure
              \item See below

                   \begin{itemize}
                       \item This is itemize
                       \pause
                       \item Notice the space between bullets
                       \pause
                       \item The bullets can be changed to number using the enumerate environment
                       \pause
                       \item \LaTeX is cool
                       \pause
                       \item I am glad you chose to give it a
thought
                       \pause
                       \item It will change the way you work
                   \end{itemize}

      \end{wideitemize}
\end{frame}


\section{Part II}


\begin{frame}{subfigure example}
      \begin{figure}
            \centering
            \subfigure[Figure 1]{
            \includegraphics[scale=0.4]{name_of_figure1_file.pdf}
            \label{model1}}  % Try putting labels for every figure, table, and equation. It will become easier to navigate if the document is huge.           
            \subfigure[Figure 2]{
            \includegraphics[scale=0.4]{name_of_figure2_file.pdf}
            \label{model2}}
      \end{figure}
\end{frame}


\begin{frame}{Math in \LaTeX}
      \begin{enumerate} 
          \item Use the ``align" or ``gather" environment for math
             \item They are neater
             \item You can also use the ``equation" and ``eqnarray" envrionments
             \item Also, notice that this is the enumerate environment
             \item Hence the numbers instead of bullets
       \end{enumerate}
\end{frame}


\centering
     \frame[plane]{
           \vspace{2cm}
                  {\huge Thank you} % Huge makes the text size, well, Huge
           \vspace{3cm}
           \begin{flushright}
                  Author name and contact
           \end{flushright}
}


\end{document}  
   

Friday 16 September 2011

Project Euler: problem 2




Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:
1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.



# Inititae a vector x with two values 1 and 2, the starting points for the Fibonacci series 
x <- c(1,2)
length(x)


# Take an object "i", with a starting value of 1. 
# This object will be used to as an index for the vector "x". 
# We continue to add# the (n - 1)th term  and the (n - 2)th term
# to get the nth term. 
# This process continues as long as an element of vector x with 
# index value "i" just crosses the 4,000,000 mark.
i <- 1
while (x[i] < 4000000){i <- i + 1
                        x.index <- length(x)
                        x[x.index + 1] <- x[x.index] + x[x.index - 1]}
x

# Sum the even values of the Fibonacci series thus obtained
sum(x[x %% 2 == 0])


Answer : 4613732

Thursday 15 September 2011

Project Euler: problem 1

To be fairly honest (assuming there are degrees of honesty), I do know a little about math and programming but I don't know much math or any programming. I've loved math for a long time, but started to learn and understand fairly recently. So during the process of learning and understanding math and a little bit of programming, Shreyes and I thought of sharing our procedures. Most of these procedures aren't the most elegant solutions and at times are plain clumsy and inefficient. We plan to improve our programming skills as we go along.

We have recently started with Project Euler problems and will be posting some of the methods that we have used to arrive at a solution for each of the problems. We know only one language, R and hence our solutions are written in R.

So let's start with problem 1

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.

I tried two approaches to solve this problem and have detailed both of them below. By the time I was through with the problem, I found another approach.

First, let's see the one that I thought was quite efficient.


# Approach 1

# Create an empty vector of desired length where the results of the division will be stored
x.1 <- rep(NA, 999) 
#NA is replicated 999 times

# For each integer from 1 to 999 (we want numbers below 1000), divide the integer # by either 3 or 5 and take the modulus. 
# If the integer, say "i" is completely divisible by either 3 or 5, assign  that value to the ith element of vector x; otherwise assign i'th value of x to equal to zero
for (i in 1:999){
                if (i %% 3 == 0 || i %% 5 == 0) {x.1[i] <- i} else {x.1[i] <- 0}
# "%%" is the modulus function, "||" is the symbol for the "or" command, x.1[i] calls the ith element of vector x.1

# Take all the non-zero values of x, i.e. those values for which the integer was perfectly divisible by either 3 or 5 and assign it to a separate vector
y.1 <- x.1[x.1 != 0]
# This assigns all the nonzero rows of vector x.1 to a vector y.1

# Take a summation of these values and this is the desired output.
sum(y.1)
# "sum" finds the sum of all the elements of vector y.1

Answer: 233168

Also, just to make sure that there is no confusion, the ".1" in x.1 and y.1 does not denote anything special, it is merely an assigning convention indicating that the variables were created for the first approach.

# Approach 2


# Take numbers from 1 to 333 and multiply each by 3 to get multiples of 3.
# We only take numbers till 333 because we have to find the sum of multiples of 3 (and 5) that are less than 1000.
x.2 <- 0
for (i in 1:333){x.2[i] <- i*3}
head(x.2) 
# head shows the first few rows of the object

# For multiples of 5, we take numbers till 199 and the last multiple of 5 below 1000 comes to be 995
y.2 <- 0
for (j in 1:199){y.2[j] <- j*5}
head(y.2)

# 15 is a common multiple of 3 and 5, and hence it will get included twice - once when we add the numbers that are multiples of 3 and once when we add numbers that are multiples of 5. So we need to subtract these 15 and its multiples.
z.2 <- 0
for (k in 1:66){z.2[k] <- k*15}
head(z.2)

# Final sum
sum(x.2) + sum(y.2) - sum(z.2)

Answer: 233168

There is another approach, which uses the funda of arithmetic progressions. Let's see if you can figure that out.