This post is about the implementation using the Java API. If you want to see the same example of calculating Moving Averages with Spark MLLib but using Scala, please check out this post.
Introduction
Since I began learning Spark, one of the things that given me the most pain is the Java API that wraps the Scala API used in Spark. So, I would like to share this problem I had using the Spark Machine Learning library in case is useful for anybody learning Spark. Also, it´s nice to take a look to the possibilities that the sliding windows give us when working on timed data.
The problem
As I wrote in my previous post, I created a new independent module to make studies on Stock data using Spark. One of the most interesting (and basic) studies one can make over Stock data, is to calculate the simple moving averages.
What is a simple moving average (SMA)?
If we take a window of time of length N, the SMA is:
Source Wikipedia: https://en.wikipedia.org/wiki/Moving_average |
One possible solution: Spark MLLib
Searching around in internet, I found that the Spark Machine Learning Libray has a sliding window operation over an RDD. Using this, it would be very easy to calculate the SMA as described in this StackOverflow answer.
Note: The RDD must be sorted before appliying the sliding function!
The caveat is that the previous code is written in Scala. How would it look like in Java?