Spark: structure of intraday data of 100 stocks to efficiently calculate moving averages etc. per stock

问题

I'm new to Spark and need to do an assignment in which I will predict the stock price direction using Random Forests.

To do this I need to calculate certain features like Moving Average. I already read in my data (100 csv files with 6 columns: time, open, close, high, low, volume) using wholeTextFiles. So now I have an RDD of file names and content. What is the most efficient way to transform this RDD in order to be able to calculate the moving average of the close column? Should I make an RDD for every stock or should I use a DataFrame or...?

Thanks in advance for any help provided!

Code snippet:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.rdd.RDD

object TradingRules {
  val conf = new SparkConf().setAppName("Thesis").setMaster("local")
  val sc = new SparkContext(conf)

  def main (args: Array[String]) {
    /**We import all the csv files of the different stocks and put them in an RDD
      * with key=file name(stock) and value=file content*/
    val data = sc.wholeTextFiles("C:\\Users\\Giselle\\OneDrive\\Thesis\\DATA")

   /**HOW TO TRANSFORM THE DATA TO BE ABLE TO CALCULATE MOVING AVERAGE PER STOCK FOR CLOSING PRICE*/
  }

  /** To calculate the simple moving average we have to make a sliding window of the
    * last five time stamps and then take the average of this*/
  def simpleMA(rdd: RDD[Double]): RDD[Double] ={
    sc.parallelize(rdd)
      .sliding(5)
      .map(curSlice => (curSlice.sum/curSlice.size))
      .collect()
  }

}

来源：https://stackoverflow.com/questions/33650880/spark-structure-of-intraday-data-of-100-stocks-to-efficiently-calculate-moving

标签

scala

apache-spark

moving-average