Financial market is interesting place, you find people taking positions (buying/selling) based on their expectations of what the security prices would be and are rewarded/penalized according to the accuracy of their expectations. The beauty of financial markets is that it provides a platform for everyone to come in with their respective expectations and allows them to interact and exchange securities. I emphasize on everyone because this everyone includes a auto-rickshaw driver, a clerk and also sophisticated econometricians and analysts. An obvious point then is that if your expectations are consistently correct, i.e you can predict the price movements before it happens on the exchange, you are a rich man. Assuming for all practical purposes that there is no oracle in our universe, who can do these predictions with 100% accuracy, the job of this prediction rests upon an econometrician/statistician. Lets see if they can do a good job too.
I took the stock returns data for INFOSYS (INFY on NSE) for the past one year and tried to see if I could make this data confess its underlying linear/non-linear generating process. I started by employing a rather simple, straight forward and easy to interpret Runs test. Its a non-parametric statistical test that will test the null hypothesis of whether the underlying series is identical and independent distributed. For those who are not too familiar with statistical parlance, non-parametric in simple term means that we have to make no assumptions about what the underlying data should be like. There is a huge surge in the applications of non-parametric statistics to explain various processes, this is because the biggest deterrence to conducting these kinds of tests, i.e the computational issues, are no longer a problem in this generation of rapid computation. The idea of empirical analysis is about trying to theorize a null hypothesis and then try your best to bring it down using empirical evidence. (analogous to Karl Popper's idea of falsification of a theory, you hang on to a theory so long as it has not betrayed you yet)
## Doing runs test on INFY daily returns
> infy <- read.csv("01-10-2010-TO-01-10-2011INFYEQN.csv") ## Reading the stock price data
> infy_ret <- 100*diff(log(infy[,2])) ## Since the second column in the data has the stock prices I have used [log(Pt) - log(Pt-1)]*100 as the returns.
> runs.test(factor(infy_ret > 0)) ## what this has done is that it has created a category variable that takes value 1 of infy_ret > 0 and 0 otherwise.
What this does is that tells me whether the runs of the returns are predictable, i.e say if I represent possitive return by + and negative return by - then my series of returns would probably look like +,+,-, +, -, -, -, +, ...
now that this test check is can I predict whether the next day will have + or -
Output:
Runs Test
data: factor(infy_ret > 0)
For those not familiar with statistics, the p-value is nothing but the probability of you reject a null hypothesis when it is actually true. So in simple words it gives me the probability that I might end up rejecting a correct null hypothesis. (be very careful with the interpretation of p-value, many times people end up misunderstanding it, many a times even I have fallen prey to this). Therefore you cannot reject your null hypothesis under such a high probability of committing this error or wrongly rejecting a correct hypothesis , you just don't have enough evidence. Therefore your series is a random walk (you can understand this in the literal English language sense, but the definition is not so trivial in time series parlance).
P.S In case you want to replicate this exercise the data can be obtained from here.
I took the stock returns data for INFOSYS (INFY on NSE) for the past one year and tried to see if I could make this data confess its underlying linear/non-linear generating process. I started by employing a rather simple, straight forward and easy to interpret Runs test. Its a non-parametric statistical test that will test the null hypothesis of whether the underlying series is identical and independent distributed. For those who are not too familiar with statistical parlance, non-parametric in simple term means that we have to make no assumptions about what the underlying data should be like. There is a huge surge in the applications of non-parametric statistics to explain various processes, this is because the biggest deterrence to conducting these kinds of tests, i.e the computational issues, are no longer a problem in this generation of rapid computation. The idea of empirical analysis is about trying to theorize a null hypothesis and then try your best to bring it down using empirical evidence. (analogous to Karl Popper's idea of falsification of a theory, you hang on to a theory so long as it has not betrayed you yet)
## Doing runs test on INFY daily returns
> infy <- read.csv("01-10-2010-TO-01-10-2011INFYEQN.csv") ## Reading the stock price data
> infy_ret <- 100*diff(log(infy[,2])) ## Since the second column in the data has the stock prices I have used [log(Pt) - log(Pt-1)]*100 as the returns.
> runs.test(factor(infy_ret > 0)) ## what this has done is that it has created a category variable that takes value 1 of infy_ret > 0 and 0 otherwise.
What this does is that tells me whether the runs of the returns are predictable, i.e say if I represent possitive return by + and negative return by - then my series of returns would probably look like +,+,-, +, -, -, -, +, ...
now that this test check is can I predict whether the next day will have + or -
Output:
Runs Test
data: factor(infy_ret > 0)
Standard Normal = 0.1308, p-value = 0.8959 ## High p-value means you cannot trash your null hypothesis.
For those not familiar with statistics, the p-value is nothing but the probability of you reject a null hypothesis when it is actually true. So in simple words it gives me the probability that I might end up rejecting a correct null hypothesis. (be very careful with the interpretation of p-value, many times people end up misunderstanding it, many a times even I have fallen prey to this). Therefore you cannot reject your null hypothesis under such a high probability of committing this error or wrongly rejecting a correct hypothesis , you just don't have enough evidence. Therefore your series is a random walk (you can understand this in the literal English language sense, but the definition is not so trivial in time series parlance).
P.S In case you want to replicate this exercise the data can be obtained from here.
Interesting post, but:
ReplyDeleteThe p value is not "the probability of [rejecting] a null hypothesis when it is actually true"
...It is the probability of obtaining the data assuming the null hypothesis is true.
Dear Anonymous,
ReplyDeleteI think that the statement in the post meant
you cannot reject your null hypothesis under such a high probability of obtaining the test statistic as extreme as for this data
This was paraphrased as committing an error [by assuming that the test-statistic is unlikely].
Also, saying that p-value is the probability of obtaining the data assuming the Null hypothesis is true is not completely correct. You probably meant obtaining the test statistic at least as extreme as for this data instead of just the data.
~
ut
Thanks for helping me out here Utkarsh. :-)
ReplyDeleteDear Anonymous,
As I had mentioned above the interpretation of p-value still remains an elusive proposition even for the statisticians.
But I agree with Utkarsh, I believe what you were trying to say was "obtaining the test statistic as extreme as for the given data" and not the data.
sir, i have a problem to discuss, doing research now but but don't know the econometric concepts i am student of finance checking the randomness in the data series using runs test and ADF test but don't known how to interpret it plz help me out i am posting the results of the tests, suggest me what the test result state? and tell me why we take first difference in these tests?
ReplyDeleteRuns test with first difference
Runs test (first difference)
Number of runs (R) in the variable 'Close' = 64
Under the null hypothesis of independence, R follows N(67.1818, 5.51174)
z-score = -0.577281, with two-tailed p-value 0.56375
run test assuming positive and negative are equioprobable(without difference)
Runs test (level)
Number of runs (R) in the variable 'Close' = 1
Under the null hypothesis of independence and equal probability of positive
and negative values, R follows N(73, 5.97913)
z-score = -12.0419, with two-tailed p-value 2.14009e-033
ADF without taking difference
Augmented Dickey-Fuller tests, order 1, for Close
sample size 142
unit-root null hypothesis: a = 1
test with constant
model: (1 - L)y = b0 + (a-1)*y(-1) + ... + e
1st-order autocorrelation coeff. for e: 0.007
estimated value of (a - 1): -0.0149455
test statistic: tau_c(1) = -1.16104
asymptotic p-value 0.6934
with constant and trend
model: (1 - L)y = b0 + b1*t + (a-1)*y(-1) + ... + e
1st-order autocorrelation coeff. for e: 0.007
estimated value of (a - 1): -0.0431525
test statistic: tau_ct(1) = -1.79382
asymptotic p-value 0.7081
ADF with first difference
Augmented Dickey-Fuller tests, order 1, for d_Close
sample size 141
unit-root null hypothesis: a = 1
test with constant
model: (1 - L)y = b0 + (a-1)*y(-1) + ... + e
1st-order autocorrelation coeff. for e: 0.004
estimated value of (a - 1): -0.968427
test statistic: tau_c(1) = -8.49099
asymptotic p-value 1.727e-014
with constant and trend
model: (1 - L)y = b0 + b1*t + (a-1)*y(-1) + ... + e
1st-order autocorrelation coeff. for e: 0.004
estimated value of (a - 1): -0.969023
test statistic: tau_ct(1) = -8.46566
asymptotic p-value 5.077e-014
expecting your quick reply
Dear Tony,
DeleteThese are a lot of results to begin with, so let me brief through the tests so that you have a fair idea.
Runs test(in level):
The extremely small p-value suggests that you can reject the null hypothesis of independence easily. This means that your series in levels has some dependence on previous values and hence some predictability.
Runs test(in differences):
The high p-value of 0.56 indicates that you cannot reject your null hypothesis of independence, thus your series in differences has little dependence on previous values and has little or no predictability.(Typical property of a stationary series)
The main reason we go ahead and first difference the data is to make it stationary. I have mentioned in my previous posts how important it is, especially for timeseries econometricians, to work with stationary time series for analysis because of the convenient statistical properties.
ADF test:
I am not really sure if I understand the interpretation of the ADF test here. I believe the test of null of unit root should be (a-1)=1 instead of a = 1. I would be a better position to help if you could provide me the codes and the data you used to arrive at these results.
Also I would encourage you to read this post for some conceptual clarity http://programming-r-pro-bro.blogspot.in/2011/12/movement-around-mean-stationary-or-unit.html
Hope this helps.
~
Shreyes
The post has made me realize that how complicated it is to predict the stock market movements.
ReplyDelete