Tuesday 4 December 2012

R FAQs for the fresh starters

R, which was largely predominant in the academic world, has started picking up a lot in businesses as well. At least that is what I am witnessing among my colleagues. Lot of people have started experimenting with R, choosing the path to enlightenment. With increase usage, however, have come an increased number of queries as well. What's interesting to see that though people are working on very different projects, the queries are largely the same, with most of them relating to data handling in R.

Keeping that in mind, I thought it would be nice to have a repository of Frequently Asked Questions. I can then directly refer the inquirers to the webpage. Below are the queries that I have been asked most often, not necessarily in order of number of queries though.

We will use the classic iris data set to illustrate the code with examples.

data(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.

1. How to name or rename a column in a data frame?
Ans: 
We can use the names() function to do this.

To find the column  names of a data frame, do
names(iris)

To change the name of a column, say the 4th column, do
names(iris)[4] <- 'new.name'

To change the names of multiple columns, say 2nd and 4th, do
names(iris)[c(2, 4)] <- c('new.name.1, 'new.name.2')

To change the names of all the columns, do
names(iris) <- c('new.name.1, 'new.name.2' ....)


2. How to determine the column information like names, type, missing values etc. in R? Similar to proc contents in SAS.
Ans:
There are two easy functions to do this.

To get brief info, do 
str(iris)

To get detailed info, do
summary(iris) 


3. How to export a data frame so that it can be used in other applications?
Ans:
The best way is to export a csv file since most applications accept that format.
write.csv(iris, 'iris.csv', row.names= FALSE)


4. How to select a particular row/column in a data frame?
Ans:
The easiest way to do this is to use the indexing notation [].

To select the first column only
iris[, 1]

To select first column and put contents in a new vector
new.vec <- iris[, 1]

To select multiple columns, say 1st, 2nd and 5th,  and put them in a data frame
new.data <- iris[, c(1, 2, 5)]

To select the first row only
iris[1, ]

To select first row and 3rd column
iris[1, 3]

To select multiple rows from the 3rd column
iris[c(1, 4, 10, 111), 3]


5. How to aggregate a data set based on a variable? Similar to group by in proc sql.
Ans:
Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.

We can use the aggregate function to do this.
iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)

iris.agg

     Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026


In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.

library(plyr)

iris.agg <- ddply(iris, .variables= .(Species), summarise,
                  sl.mean = mean(Sepal.Length),
                  sw.median = median(Sepal.Width),
                  pl.max = max(Petal.Length),
                  pw.sd = sd(Petal.Width), .progress= 'text')

iris.agg

     Species sl.mean sw.median pl.max     pw.sd
1     setosa   5.006       3.4    1.9 0.1053856
2 versicolor   5.936       2.8    5.1 0.1977527
3  virginica   6.588       3.0    6.9 0.2746501




6. How to create deciles of a particular variable? Similar to proc rank in SAS.
Ans:
We can use the cut() function for this.
For example, to create deciles or 10 bins from Sepal.Length, do


iris$sp.decile <- cut(iris$Sepal.Length, 
                      breaks= quantile(iris$Sepal.Length, 
                      probs= seq(0, 1, by= 0.1)),
                      include.lowest= TRUE, labels= c(1:10))

table(iris$sp.decile)


 1  2  3  4  5  6  7  8  9 10 
16 16 13 20 15 15 13 12 17 13

Here we create 'sp.decile' as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below.


var.means.by.spdecile <- aggregate(iris[, c(1, 2, 3, 4)], by= 
list(iris$sp.decile), FUN= mean)

7. How to deal with missing values? 
Ans:
Dealing with missing values in R is not very difficult, provided we use the correct notation.

Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do

data.set <- read.csv('filename.csv', na.strings= '.')

In case the missing value is #N/A, then do
data.set <- read.csv('filename.csv', na.strings= '#N/A')

Similarly for other cases, we can substitute the missing value notation in the na.strings argument.

In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.

These are some of the common doubts that I have come across. I'll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I'll do it right away.