While working for my Financial economics project I came across this elegant tool called Principal component analysis (PCA)which is an extremely powerful tool when it comes to reducing the dimentionality of a data set comprising of highly correlated variables. This tool finds majority application in genetic research, which deals with data sets having many variables that are highly correlated.

I will try and be as explicit and refrain from using statistical/mathematical jargons to explain what/how about this tool . To state a few stylized facts PCA is used mainly for:

I was trying to investigate the factors that affect the returns of stocks in the Indian equity market, however I wanted to take into account all the S&P CNX 500 companies. What would be really nice if I could somehow find a way of squeezing the 500 companies into say not more than 2-3 variables that can be representative of the entire set of 500 companies. This is precisely where PCA comes into play and does a fantastic job. What it gives me is just 1 variable that I can use instead of all the 500 companies!!!

I will try and be as explicit and refrain from using statistical/mathematical jargons to explain what/how about this tool . To state a few stylized facts PCA is used mainly for:

- compressing the data
- filter some of the noise in the data

**Problem at hand:**

Hats off and a bow of respect for the contributors/donors of packages to the CRAN servers that the above simplification can be achieved using just one line of script in R. Sounds easy, but what one really needs to do is to understand what PCA does and how the output from this script can be interpreted. Again, at the risk of over simplication (however trying hard to maintain my commandment of simplicity), I would illustrate in a crude manner the working of PCA.

**What PCA does:**

Let me explain this relating to the above example, if I do a PCA on the returns data for the 500 companies, I would obtain 500 principal components. These components are nothing but the linear combination of the existing 500 variables(companies) arranged in the decreasing order of their variance. So 1st principal component (PC)has the maximum variance and 500th principal component (PC)has the least variance. The variance in the PCA represent nothing but the variance in the data. So 1st PC explains the maximum amount of variance in my data. One magical feature of PCA is that all these 500 components will be orthogonal to each other, meaning these components will be uncorrelated with each other. So essentially if we look at PCA as a black box it takes inputs as data set of highly correlated variables and gives as output PC's that explain the variance in the input data and they are uncorrelated with each other.(I don't leverage this feature in this particular problem, I would illustrate this use in other part of this blog)

**How PCA does it:**

Since I have taken a vow of simplicity, I dont have much to say here.:-) However for the mathematically inclined and certainty freaks like Madhav, this paper does a brilliant job of illustrating the matrix algebra that goes behind PCA computations. There are essentially 2 methods of calculating PCA, one is the eigenvalue decomposition (done using princomp() command in R)and the other is singular value decomposition (done using prcomp() command using R).

**How this can be done in R:**

####### Calculating Principal component of returns of S&P CNX 500 companies ########

## Access the relevant file ##

returns <- read.csv("Returns_CNX_500.csv")

One caveat that you need to keep in mind in that there should be no "NA" values in your data set. A presence of an NA would impede the computation of the var-covar matrix and hence their eigen vectors(i.e the factor loadings)

## Dealing with missing values in the returns data for companies

for(i in 2:ncol(returns))

{

returns1[, i] <- approx(returns$Year, returns1[ ,i], returns$Year)$y ## approx function basically fits the value of linear approximate between the missing data points and the column $y stores the approximated values.

}

## Convert the data into matrix ##

ret <- as.matrix(returns1, nrow = dim(returns1)[1], ncol = dim(returns1)[2])

##Computing the principal component using eigenvalue decomposition ##

princ.return <- princomp(ret) ## This is it.!!

## Identifying what components to be used ##

barplot(height=princ.return$sdev[1:10]/princ.return$sdev[1]) ## I am plotting the standard deviation of the PC's divided by standard deviation of PC 1, this can help us decide on a benchmark that we can use to select the relevant components.

Standard deviation of the first 10 components compared to 1st PC |

We can clearly see from the above figure that as expected the first PC does the majority of the variance explanation in the returns data for the 500 companies. So if we want to identify factors that influence the returns of S&P CNX 500 companies I can use the 1st PC as a variable in my regression. So far we have calculated the principal components, now we will extract out 1st PC as a numeric variable from the matrix.(princ.return)

## To get the first principal component in a variable ##

load <- loadings(princ.return)[,1] ## loadings() gives the linear combination by which our input variables will be linearly weighted to compute the components, and this command gives us the loading for 1st PC.

pr.cp <- ret %*% load ## Matrix multiplication of the input data with the loading for the 1st PC gives us the 1st PC in matrix form.

pr <- as.numeric(pr.cp) ## Gives us the 1st PC in numeric form in pr.

One question that might be raised is why not just use the S&P CNX 500 index returns as an input to the regression? The simple answer to that question would be that PC 1 gives you a relatively clear signal of the returns as opposed to the index which would have a lot of noise. This question would have made sense in the 1900's when the technology was not so efficient in terms of computation. Since now computational time and effort finds minimum weight in any researchers mind there is no reason to settle for anything but the best.

There is an important caveat that must be kept in mind while doing analysis using PCA, though PCA has a clear mathematical intuition it lacks an economic intuition. That is, one unit change in PC 1 of returns has a mathematical meaning but no economic meaning, you cannot make sense of this statement that PC 1 of returns for the 500 companies has gone up by "x" amount. Therefore the use of this analysis should be limited to factor analysis and not to be extended to predictive analysis.

In case you wish to replicate the above exercise the data can be obtained from here.

Good post.

ReplyDeleteHowever, I would urge some caution here in your use of PCA. PCA is a great technique (in fact, i went through a phase of thinking it could solve all economic problems), but its not magic.

Firstly, you need to check the proportion of variance explained by your first component. From eyeballing your plot, it looks to be around 20%. This may or may not be too low for you.

Secondly, PCA is a data reduction tool - it breaks down the matrix in terms of all variance. A similiar, but different technique known as factor analysis only breaks down the variance the items have in common, and I suspect that this will be more useful to you.

Thirdly, you should probably check to see how well your component can recreate the matrix, this can be done by structural equation modelling, available in the sem, lavaan and OpenMx packages. I highly recommend getting the psych package, and reading its vignettes, as they have a host of information on these kinds of techniques. Hope this helps.

Hi Disgrunled PhD,

ReplyDeleteThank you for your post.

I think there is a little confusion regarding the plot that I have pasted above. The plot is just giving me the ratio of standard deviations of my 1st PC to subsequent PC's(I have plotted just the first 10). So it would be read as the 2nd PC has a standard deviation that is just 20% the standard deviation of the 1st PC. I think I will also have to paste the absolute variance of each of the principal components to make my point clear.

Thanks for the references. I will surely explore what factor analysis has to offer.

Be careful. In psychometrics, PCA and FA are natural fits for data which is fuzzy to begin with. The synthesized independent variables which result are no more or less tied to the Real World than the original data. With econometrics, not so much. The original data is of the real world, and the meaning of the synthesized variables is tenuous.

ReplyDeleteThanks for the comment Robert.

ReplyDeleteYes I agree that the caveat of interpretation has to be kept in mind while applying this tool. It is this specific problem, of dealing with factors affecting stock returns, that this tool finds relevance in financial econometrics. Otherwise the synthesized factors could be difficult to interpret or even misleading.

P.S : I should have been emphatic about the fact that PCA should

ReplyDeleteNOTbe applied to data that has independent variables. A prerequisite for you to apply PCA is that the variables in your data should be highly correlated.Hey Shreyes

ReplyDeleteYour post was a great read!Very informative and simply put! Thanks for that. I have two comments in this regard:

1. When you use the function that gives you the components, is it ensured that the factors are rotated to maintain orthogonality. I ask this because in the packages I have used; the rotation is not directly carried out. Just confirning..

2. Towards the end you caveat your blog with the lack of economic intuition/interpretability. I would just like to raise one point here. When I had used PCA, the components that I had gotten had (luckily for me) very neatly alligned into groups; what I mean is that the variables which seemed to be related to each other not only in terms of the covariances, but even intuition-wise had neatly formed groups. So I would expect that in your case of S&P CNX 500, stocks belonging to similar sector (say, telecom or IT) would be in different component. This is as per intuition also because, since these tend to move together, they are bound to have high values on covariances. Maybe I am giving too much credit to PCA; kindly, correct me if i m wrong.

Again, would like to commend you for the enlighterning posts!

Thanks

Esha

Thank you for the encouragement Esha, really appreciate that.

ReplyDelete1) In the function princomp(), that uses eigen value decomposition to compute the components, the orthogonality of the components is ensured. I mentioned that in a rather (too)simplistic way when I said that the components are uncorrelated.(which essentially follows from orthogonality of the components).

2) Well there might be cases wherein PCA might make intuitive economic sense but in most cases it wont. In the above case where I have taken returns, it makes some sense as economic theory suggests that returns on all the stocks are highly correlated. So the principal components do a fair job of explaining the variance in the data, but I doubt that I could say that the components represents different sectors. The fact that PC's are orthogonal and

completely uncorrelatedmakes this relation untenable, as we know that even sector specific returns are more or less correlated, positively or negatively.The PC's I would say is a fictitious creature which can help us spot relationships but is not of much help when we want to quantify the magnitude of difference the factors make.

My apologies Shreyes...With reference to comment 2: i realised that what i was talking about is in fact factor analysis. I got mixed up with the two. Nevertheless, your post further clarified my doubts. Thanks again!

ReplyDeleteHey Esha,

ReplyDeleteI guess in factor analysis the characteristic of factors is that they explain the variability due to common factors, but the orthogonality of these factors is not necessary. But I am not very convinced if we can intuitively assume the factors to represent different sector returns.

I'll try and dig into this more and see if I can figure this out.

Hey Shreyes

ReplyDeleteI have a question yaar...

Why has the data frame been converted to a matrix before the princomp() function was applied?

Doesn't the princomp() function work on dataframes?

Good point Madhav, I did the princomp() on data.frames and it did the job.

ReplyDeleteBut I had a problem while extracting the components viz matrix multiplication. (refer to "

pr.cp <- ret %*%" command above).I am sure there is a workaround for that too (though is might be a little tricky), but the one reason that I can think of why we input a matrix is that the output that you obtain may be more amicable to manipulation later. In corporate lingo referred to as standard operating procedures (SOPs).

So I would say that you could use it on data.frames too but I guess the handling becomes a bit of a problem then.

Hope that answers your question.

Hi,

ReplyDeleteJust a small comment on the use of the variable name

loadin the example.loadis also a primitive function defined in R. Your using it as a variable shadows the old meaning, but this can lead to unintended consequences later on because R does not handle overloading of primitive names very gracefully..~

ut

Thanks for pointing that out Ut, will keep that in mind.

ReplyDeleteAnother issue with PCA (and factor analysis) is it was originally used for cross-sectional data. It is not really ideal to use it on time series data as it assumes no autocorrelation.

ReplyDeleteI suggest you look into Dynamic principle components (and dynamic factor analysis) which allows the latent factors to have a lag structure.

Cheers,

Luke

Thanks for your comment Luke.

ReplyDeleteIt is actually a good point, I have been meaning to look into Dynamic PC for further applications. Will post about it once I get it expedited. :-)

~

Shreyes

How can we know each component (e.g component 2)is referring which variable?

ReplyDeleteDear Anonymous,

ReplyDeleteload <- loadings(princ.return)[,1] (This command is extracting the factor loading in the 1st principal component.)

If you know the ordering of variables then you can simply refer to that row and you can get the desired weight. For example, "> loadings(princ.return)[2,2]" would give you the loading weight of the 2nd variable(as ordered by you in the data.frame) in the 2nd principal component.

Hope this answers your question.

~

Shreyes

Hi Shreyes, Thanks for the sample analysis . I have survey data from 7000 respondents on 45 variables and was trying to eliminate redundant variables. I executed the above commands on my data set. I got the 1st components in numerical form - 7000 numerical values. Now how do I choose which components are important and how they relate to my variables and how do I group multiple vairables or eliminate redundant ones?

ReplyDelete