We think therefore we R: 2012

R, which was largely predominant in the academic world, has started picking up a lot in businesses as well. At least that is what I am witnessing among my colleagues. Lot of people have started experimenting with R, choosing the path to enlightenment. With increase usage, however, have come an increased number of queries as well. What's interesting to see that though people are working on very different projects, the queries are largely the same, with most of them relating to data handling in R.

Keeping that in mind, I thought it would be nice to have a repository of Frequently Asked Questions. I can then directly refer the inquirers to the webpage. Below are the queries that I have been asked most often, not necessarily in order of number of queries though.

We will use the classic iris data set to illustrate the code with examples.

data(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.

1. How to name or rename a column in a data frame?

Ans:

We can use the names() function to do this.

To find the column names of a data frame, do

names(iris)

To change the name of a column, say the 4th column, do

names(iris)[4] <- 'new.name'

To change the names of multiple columns, say 2nd and 4th, do

names(iris)[c(2, 4)] <- c('new.name.1, 'new.name.2')

To change the names of all the columns, do

names(iris) <- c('new.name.1, 'new.name.2' ....)

2. How to determine the column information like names, type, missing values etc. in R? Similar to proc contents in SAS.

Ans:

There are two easy functions to do this.

To get brief info, do

str(iris)

To get detailed info, do

summary(iris)

3. How to export a data frame so that it can be used in other applications?
Ans:
The best way is to export a csv file since most applications accept that format.
write.csv(iris, 'iris.csv', row.names= FALSE)

4. How to select a particular row/column in a data frame?
Ans:
The easiest way to do this is to use the indexing notation [].

To select the first column only
iris[, 1]

To select first column and put contents in a new vector
new.vec <- iris[, 1]

To select multiple columns, say 1st, 2nd and 5th, and put them in a data frame
new.data <- iris[, c(1, 2, 5)]

To select the first row only
iris[1, ]

To select first row and 3rd column
iris[1, 3]

To select multiple rows from the 3rd column
iris[c(1, 4, 10, 111), 3]

5. How to aggregate a data set based on a variable? Similar to group by in proc sql.
Ans:
Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.

We can use the aggregate function to do this.
iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)

iris.agg

Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026

In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.

library(plyr)

iris.agg <- ddply(iris, .variables= .(Species), summarise,
sl.mean = mean(Sepal.Length),
sw.median = median(Sepal.Width),
pl.max = max(Petal.Length),
pw.sd = sd(Petal.Width), .progress= 'text')

iris.agg

Species sl.mean sw.median pl.max pw.sd
1 setosa 5.006 3.4 1.9 0.1053856
2 versicolor 5.936 2.8 5.1 0.1977527
3 virginica 6.588 3.0 6.9 0.2746501

6. How to create deciles of a particular variable? Similar to proc rank in SAS.
Ans:
We can use the cut() function for this.
For example, to create deciles or 10 bins from Sepal.Length, do

iris$sp.decile <- cut(iris$Sepal.Length,
breaks= quantile(iris$Sepal.Length,
probs= seq(0, 1, by= 0.1)),
include.lowest= TRUE, labels= c(1:10))

table(iris$sp.decile)

1 2 3 4 5 6 7 8 9 10
16 16 13 20 15 15 13 12 17 13

Here we create 'sp.decile' as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below.

var.means.by.spdecile <- aggregate(iris[, c(1, 2, 3, 4)], by=
list(iris$sp.decile), FUN= mean)

7. How to deal with missing values?
Ans:
Dealing with missing values in R is not very difficult, provided we use the correct notation.

Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do

data.set <- read.csv('filename.csv', na.strings= '.')

In case the missing value is #N/A, then do
data.set <- read.csv('filename.csv', na.strings= '#N/A')

Similarly for other cases, we can substitute the missing value notation in the na.strings argument.

In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.

These are some of the common doubts that I have come across. I'll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I'll do it right away.

This post is meant for submitting visual analysis for the Harvard Business Review Contest on Kaggle

I used the subject lines for all the articles and all the years and mapped the articles into one of the following 18 categories

Business Ethics
Business Management
Crisis
Emerging Markets
Financial Performance
Health Care
Information Technology
Labor
Leadership
Management Systems
Marketing Strategy
Regulation
Social Media
Stock Market
Strategic Planning
Supply Chain
United States & World
Women & Management

Changes in popularity of these topics were visualized using the googleVis package for R. This visualization is available here (I could not figure out how to upload it on Kaggle).

Observations:

The average number of pages per article has gone down steadily since the 1950s falling to below 5 pages per article in for the first time in 1981 and then staying pretty much below that mark. Could this partly be attributed to the internet revolution?
Recent trending topics are related to Emerging Markets & China, and Social Media.
Some evergreen topics in HBR include - Business Management, Employees/Workforce, Labor, Marketing Strategy, and Strategic Planning.
The lengthiest articles are on issues concerning United States & the World, Regulation and Management Systems

We think therefore we R

Tuesday, 4 December 2012

R FAQs for the fresh starters

Monday, 8 October 2012

CrowdANALYTIX - Ideation Contest - Warranty Pricing

Sunday, 26 August 2012

Kaggle Prospect - Harvard Business Review