Friday, 21 October 2011

Predictability of stock returns : Using runs.test()

Financial market is interesting place, you find people taking positions (buying/selling) based on their expectations of what the security prices would be and are rewarded/penalized according to the accuracy of their expectations. The beauty of financial markets is that it provides a platform for everyone to come in with their respective expectations and allows them to interact and exchange securities. I emphasize on everyone because this everyone includes a auto-rickshaw driver, a clerk and also sophisticated econometricians and analysts. An obvious point then is that if your expectations are consistently correct, i.e you can predict the price movements before it happens on the exchange, you are a rich man. Assuming for all practical purposes that there is no oracle in our universe, who can do these predictions with 100% accuracy, the job of this prediction rests upon an econometrician/statistician. Lets see if they can do a good job too.

I took the stock returns data for INFOSYS (INFY on NSE) for the past one year and tried to see if I could make this data confess its underlying linear/non-linear generating process. I started by employing a rather simple, straight forward and easy to interpret Runs test. Its a non-parametric statistical test that will test the null hypothesis of whether the underlying series is identical and independent distributed. For those who are not too familiar with statistical parlance, non-parametric in simple term means that we have to make no assumptions about what  the underlying data should be like. There is a huge surge in the applications of non-parametric statistics to explain various processes, this is because the biggest deterrence to conducting these kinds of tests, i.e the computational issues, are no longer a problem in this generation of rapid computation. The idea of empirical analysis is about trying to theorize a null hypothesis and then try your best to bring it down using empirical evidence. (analogous to Karl Popper's idea of falsification of a theory, you hang on to a theory so long as it has not betrayed you yet)

## Doing runs test on INFY daily returns
> infy <- read.csv("01-10-2010-TO-01-10-2011INFYEQN.csv")  ## Reading the stock price data


> infy_ret <- 100*diff(log(infy[,2])) ## Since the second column in the data has the stock prices I have used [log(Pt) - log(Pt-1)]*100 as the returns.
  
> runs.test(factor(infy_ret > 0))   ## what this has done is that it has created a category variable that takes value 1 of infy_ret > 0 and 0 otherwise. 

What this does is that tells me whether the runs of the returns are predictable, i.e say if I represent possitive return by + and negative return by - then my series of returns would probably look like +,+,-, +, -, -, -, +, ...
now that this test check is can I predict whether the next day will have  + or  -


Output:
Runs Test
data: factor(infy_ret > 0)
Standard Normal = 0.1308, p-value = 0.8959  ## High p-value means you cannot trash your null hypothesis. 

For those not familiar with statistics, the p-value is nothing but the probability of you reject a null hypothesis when it is actually true. So in simple words it gives me the probability that I might end up rejecting a correct null hypothesis. (be very careful with the interpretation of p-value, many times people end up misunderstanding it,  many a times even I have fallen prey to this). Therefore you cannot reject your null hypothesis under such a high probability of committing this error or wrongly rejecting a correct hypothesis , you just don't have enough evidence. Therefore your series is a random walk (you can understand this in the literal English language sense, but the definition is not so trivial in time series parlance).

P.S In case you want to replicate this exercise the data can be obtained from here.

5 comments:

  1. Interesting post, but:
    The p value is not "the probability of [rejecting] a null hypothesis when it is actually true"

    ...It is the probability of obtaining the data assuming the null hypothesis is true.

    ReplyDelete
  2. Dear Anonymous,

    I think that the statement in the post meant

    you cannot reject your null hypothesis under such a high probability of obtaining the test statistic as extreme as for this data


    This was paraphrased as committing an error [by assuming that the test-statistic is unlikely].

    Also, saying that p-value is the probability of obtaining the data assuming the Null hypothesis is true is not completely correct. You probably meant obtaining the test statistic at least as extreme as for this data instead of just the data.

    ~
    ut

    ReplyDelete
  3. Thanks for helping me out here Utkarsh. :-)

    Dear Anonymous,

    As I had mentioned above the interpretation of p-value still remains an elusive proposition even for the statisticians.

    But I agree with Utkarsh, I believe what you were trying to say was "obtaining the test statistic as extreme as for the given data" and not the data.

    ReplyDelete
  4. sir, i have a problem to discuss, doing research now but but don't know the econometric concepts i am student of finance checking the randomness in the data series using runs test and ADF test but don't known how to interpret it plz help me out i am posting the results of the tests, suggest me what the test result state? and tell me why we take first difference in these tests?

    Runs test with first difference

    Runs test (first difference)

    Number of runs (R) in the variable 'Close' = 64
    Under the null hypothesis of independence, R follows N(67.1818, 5.51174)
    z-score = -0.577281, with two-tailed p-value 0.56375


    run test assuming positive and negative are equioprobable(without difference)

    Runs test (level)

    Number of runs (R) in the variable 'Close' = 1
    Under the null hypothesis of independence and equal probability of positive
    and negative values, R follows N(73, 5.97913)
    z-score = -12.0419, with two-tailed p-value 2.14009e-033


    ADF without taking difference
    Augmented Dickey-Fuller tests, order 1, for Close
    sample size 142
    unit-root null hypothesis: a = 1

    test with constant
    model: (1 - L)y = b0 + (a-1)*y(-1) + ... + e
    1st-order autocorrelation coeff. for e: 0.007
    estimated value of (a - 1): -0.0149455
    test statistic: tau_c(1) = -1.16104
    asymptotic p-value 0.6934

    with constant and trend
    model: (1 - L)y = b0 + b1*t + (a-1)*y(-1) + ... + e
    1st-order autocorrelation coeff. for e: 0.007
    estimated value of (a - 1): -0.0431525
    test statistic: tau_ct(1) = -1.79382
    asymptotic p-value 0.7081

    ADF with first difference

    Augmented Dickey-Fuller tests, order 1, for d_Close
    sample size 141
    unit-root null hypothesis: a = 1

    test with constant
    model: (1 - L)y = b0 + (a-1)*y(-1) + ... + e
    1st-order autocorrelation coeff. for e: 0.004
    estimated value of (a - 1): -0.968427
    test statistic: tau_c(1) = -8.49099
    asymptotic p-value 1.727e-014

    with constant and trend
    model: (1 - L)y = b0 + b1*t + (a-1)*y(-1) + ... + e
    1st-order autocorrelation coeff. for e: 0.004
    estimated value of (a - 1): -0.969023
    test statistic: tau_ct(1) = -8.46566
    asymptotic p-value 5.077e-014
    expecting your quick reply

    ReplyDelete
    Replies
    1. Dear Tony,

      These are a lot of results to begin with, so let me brief through the tests so that you have a fair idea.

      Runs test(in level):

      The extremely small p-value suggests that you can reject the null hypothesis of independence easily. This means that your series in levels has some dependence on previous values and hence some predictability.

      Runs test(in differences):

      The high p-value of 0.56 indicates that you cannot reject your null hypothesis of independence, thus your series in differences has little dependence on previous values and has little or no predictability.(Typical property of a stationary series)

      The main reason we go ahead and first difference the data is to make it stationary. I have mentioned in my previous posts how important it is, especially for timeseries econometricians, to work with stationary time series for analysis because of the convenient statistical properties.

      ADF test:

      I am not really sure if I understand the interpretation of the ADF test here. I believe the test of null of unit root should be (a-1)=1 instead of a = 1. I would be a better position to help if you could provide me the codes and the data you used to arrive at these results.

      Also I would encourage you to read this post for some conceptual clarity http://programming-r-pro-bro.blogspot.in/2011/12/movement-around-mean-stationary-or-unit.html

      Hope this helps.

      ~
      Shreyes

      Delete