Wednesday, 4 January 2012

Memoization in R : Illustrative example

I came across a nice problem at project euler that gave me sense of satisfaction that was unusual, I think that because I don't usually get the solutions right the first time as I did in this case. Anyhow, I shall try and decode the R codes that I used in simple English language and Mathematics.



Let me first illustrate the brute force method, that is usually the method used by novice coders like myself. The idea here is to find the largest number below 1 million that gives the maximum number of the above mentioned iterations.



So what I have done above is simply performed the iteration for each and every integer from 1 to 1 million and using a counter variable kept a track of which number gave me the largest number of iterations and recorded the corresponding number, which is what we needed in the end. The idea was straight forward the only challenge was to come up with that cute function (which we now see is not that challenging after all).

Well, now that the novice part is done lets get to what Utkarsh (my pro bro) had to say about this. My codes took ~ 701.75 seconds to run (on my Sony vaio VPCCW12EN), this was completely fine by me. Utkarsh shrugged in his usual nonchalant manner at my codes and came up with an awesome algorithm to optimize the above calculation and saving some precious time (which I think he referred to as Memoization). The idea that he worked on was that since in many cases we would already have computed the number of iterations there was no need to keep computing then again. Suppose in the example in the question we see that 13 -> 40 -> 20 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1. Now in the computation of 13 if say we already have that 10 will further iterate say 6 times we would not have to go all the way to 1. Similarly even for 10 if we know that 5 further iterates 5 times we don't need to go all the way back till 1. This would be more clear when we take a look at the codes.



The above codes, courtesy Utkarsh, took ~ 50 seconds. As it turns out I was 1,390% inefficient as compared to this optimal algorithm. I would glad to know if there is any other optimization technique (apart from using super computers) that might reduce the computational time, please share if you can find a better way of coding this in R. 

10 comments:

  1. I know, the simplicity and sheer logic is fascinating :-)

    ReplyDelete
  2. Nice post. I can't think of anyway to add any real significant improvement but for a you could replace:

    memo <- rep(-1, limit)
    memo[1] <- 0

    with this single line:

    memo <- vector("numeric", limit)

    ReplyDelete
  3. Thank you for your comment tonybreyal.

    I should have done that, thanks for the suggestion. :-)

    ReplyDelete
  4. Thinking about it, a speed up of roughly 17% is possible (in the code an the end of this comment) if we make more efficient use of memo by recoding counts as we go along so as not to calculate them more than once:

    # test replications elapsed relative
    # single_call2(n) 1 11.020 1.000000
    # single_call(n) 1 12.922 1.172595

    code is here (sorry, don't know how to post Rcode in comments):

    https://gist.github.com/1566992

    ReplyDelete
  5. That is a quite elegant way of saving the future values for memo.

    Actually, the problem itself is stated in a recursive manner which begs for a recursive solution. In that recursive solution, saving the future values of memo is not only easier, but also natural. In Python or C, I would implement it in a recursive fashion with memo as a global array or passed as reference.

    Unfortunately, because of the functional nature of R, passing and returning an array of the size of memo by value becomes prohibitively expensive and though I can alter global variables in R it has been R.oo, but that seems like using a cannon to kill a fly.

    Your code fixes that problem quite elegantly in the iterative version, but I am still wondering whether there is a recursive solution using memoization which runs as fast as your iterative solution without using any deprecated R features.

    ~
    Ut.

    ReplyDelete
  6. The memory and function call overhead of doing recursion in R is very unfortunate indeed :(

    The compiler package might help speed things up if it does some sort of tail recursion optimisation.

    hmm, there's one quick way of speeding things up in our none-recursive functions - just compile them:

    library(compiler)
    single_call_compiled <- cmpfun(single_call)
    single_call2_compiled <- cmpfun(single_call2)


    test elapsed relative
    2 single_call_compiled(n) 27.065 1.000000
    4 single_call2_compiled(n) 34.915 1.290042
    3 single_call2(n) 107.220 3.961574
    1 single_call(n) 126.274 4.665583

    The compiled version of your function comes out as the fastest in this case.

    ReplyDelete
  7. Thanks Utkarsh and tonybreyal for sharing the useful insights.

    The intricacies of the codes above would be a little difficult for a R-newbie to understand but the idea of storing "future memo's" makes perfect intuitive sense.

    Thank you for the codes, keep sharing.

    ~
    Shreyes

    ReplyDelete
  8. I agree with your thought. Useful information shared. I am very happy to read this article. Thanks for giving us nice info. Fantastic walk-through. I appreciate this post. multiple sclerosis icd10 This is a really good read for me. Must agree that you are one of the coolest blogger I ever saw. Thanks for posting this useful information. This was just what I was on looking for. I'll come back to this blog for sure!

    ReplyDelete