Tuesday 4 December 2012

R FAQs for the fresh starters

R, which was largely predominant in the academic world, has started picking up a lot in businesses as well. At least that is what I am witnessing among my colleagues. Lot of people have started experimenting with R, choosing the path to enlightenment. With increase usage, however, have come an increased number of queries as well. What's interesting to see that though people are working on very different projects, the queries are largely the same, with most of them relating to data handling in R.

Keeping that in mind, I thought it would be nice to have a repository of Frequently Asked Questions. I can then directly refer the inquirers to the webpage. Below are the queries that I have been asked most often, not necessarily in order of number of queries though.

We will use the classic iris data set to illustrate the code with examples.

data(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.

1. How to name or rename a column in a data frame?
Ans: 
We can use the names() function to do this.

To find the column  names of a data frame, do
names(iris)

To change the name of a column, say the 4th column, do
names(iris)[4] <- 'new.name'

To change the names of multiple columns, say 2nd and 4th, do
names(iris)[c(2, 4)] <- c('new.name.1, 'new.name.2')

To change the names of all the columns, do
names(iris) <- c('new.name.1, 'new.name.2' ....)


2. How to determine the column information like names, type, missing values etc. in R? Similar to proc contents in SAS.
Ans:
There are two easy functions to do this.

To get brief info, do 
str(iris)

To get detailed info, do
summary(iris) 


3. How to export a data frame so that it can be used in other applications?
Ans:
The best way is to export a csv file since most applications accept that format.
write.csv(iris, 'iris.csv', row.names= FALSE)


4. How to select a particular row/column in a data frame?
Ans:
The easiest way to do this is to use the indexing notation [].

To select the first column only
iris[, 1]

To select first column and put contents in a new vector
new.vec <- iris[, 1]

To select multiple columns, say 1st, 2nd and 5th,  and put them in a data frame
new.data <- iris[, c(1, 2, 5)]

To select the first row only
iris[1, ]

To select first row and 3rd column
iris[1, 3]

To select multiple rows from the 3rd column
iris[c(1, 4, 10, 111), 3]


5. How to aggregate a data set based on a variable? Similar to group by in proc sql.
Ans:
Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.

We can use the aggregate function to do this.
iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)

iris.agg

     Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026


In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.

library(plyr)

iris.agg <- ddply(iris, .variables= .(Species), summarise,
                  sl.mean = mean(Sepal.Length),
                  sw.median = median(Sepal.Width),
                  pl.max = max(Petal.Length),
                  pw.sd = sd(Petal.Width), .progress= 'text')

iris.agg

     Species sl.mean sw.median pl.max     pw.sd
1     setosa   5.006       3.4    1.9 0.1053856
2 versicolor   5.936       2.8    5.1 0.1977527
3  virginica   6.588       3.0    6.9 0.2746501




6. How to create deciles of a particular variable? Similar to proc rank in SAS.
Ans:
We can use the cut() function for this.
For example, to create deciles or 10 bins from Sepal.Length, do


iris$sp.decile <- cut(iris$Sepal.Length, 
                      breaks= quantile(iris$Sepal.Length, 
                      probs= seq(0, 1, by= 0.1)),
                      include.lowest= TRUE, labels= c(1:10))

table(iris$sp.decile)


 1  2  3  4  5  6  7  8  9 10 
16 16 13 20 15 15 13 12 17 13

Here we create 'sp.decile' as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below.


var.means.by.spdecile <- aggregate(iris[, c(1, 2, 3, 4)], by= 
list(iris$sp.decile), FUN= mean)

7. How to deal with missing values? 
Ans:
Dealing with missing values in R is not very difficult, provided we use the correct notation.

Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do

data.set <- read.csv('filename.csv', na.strings= '.')

In case the missing value is #N/A, then do
data.set <- read.csv('filename.csv', na.strings= '#N/A')

Similarly for other cases, we can substitute the missing value notation in the na.strings argument.

In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.

These are some of the common doubts that I have come across. I'll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I'll do it right away.



Monday 8 October 2012

CrowdANALYTIX - Ideation Contest - Warranty Pricing

I recently completed an ideation contest on CrowdANALYTIX where the participants had to build an approach towards warranty pricing and fraud detection.

Ideation contests are quite different from the usual data mining contests where the objective is solely to minimize the error (or maximize the accuracy). They are centered more around defining the problem and conceptualizing it with a framework.

In the contest, we had to structure the problem first with respect to the business and then extend it to provide possible analytical solutions for optimizing warranty prices and detecting fraud, including potential data that we would require. Having no experience in either of these areas (warranty and fraud), I tried to draw parallels between insurance pricing and warranty pricing keeping the fundamental differences between them in mind.

Luckily, my approach was chosen to be one of the finalists and was posted on their website.

Sunday 26 August 2012

Kaggle Prospect - Harvard Business Review

This post is meant for submitting visual analysis for the Harvard Business Review Contest on Kaggle

I used the subject lines for all the articles and all the years and mapped the articles into one of the following 18 categories


  1.  Business Ethics
  2.  Business Management
  3.  Crisis
  4.  Emerging Markets
  5.  Financial Performance
  6.  Health Care
  7.  Information Technology
  8.  Labor
  9.  Leadership
  10.  Management Systems
  11.  Marketing Strategy
  12.  Regulation
  13.  Social Media
  14.  Stock Market
  15.  Strategic Planning
  16.  Supply Chain
  17.  United States & World
  18.  Women & Management


Changes in popularity of these topics were visualized using the googleVis package for R. This visualization is available here (I could not figure out how to upload it on Kaggle).

Observations:


  1. The average number of pages per article has gone down steadily since the 1950s falling to below 5 pages per article in for the first time in 1981 and then staying pretty much below that mark. Could this partly be attributed to the internet revolution?
  2. Recent trending topics are related to Emerging Markets & China, and Social Media.
  3. Some evergreen topics in HBR include - Business Management, Employees/Workforce, Labor, Marketing Strategy, and Strategic Planning.
  4. The lengthiest articles are on issues concerning United States & the World, Regulation and Management Systems