Spark: structure of intraday data of 100 stocks to efficiently calculate moving averages etc. per stock

怎甘沉沦 提交于 2020-01-14 10:24:27

问题


I'm new to Spark and need to do an assignment in which I will predict the stock price direction using Random Forests.

To do this I need to calculate certain features like Moving Average. I already read in my data (100 csv files with 6 columns: time, open, close, high, low, volume) using wholeTextFiles. So now I have an RDD of file names and content. What is the most efficient way to transform this RDD in order to be able to calculate the moving average of the close column? Should I make an RDD for every stock or should I use a DataFrame or...?

Thanks in advance for any help provided!

Code snippet:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.rdd.RDD

object TradingRules {
  val conf = new SparkConf().setAppName("Thesis").setMaster("local")
  val sc = new SparkContext(conf)

  def main (args: Array[String]) {
    /**We import all the csv files of the different stocks and put them in an RDD
      * with key=file name(stock) and value=file content*/
    val data = sc.wholeTextFiles("C:\\Users\\Giselle\\OneDrive\\Thesis\\DATA")

   /**HOW TO TRANSFORM THE DATA TO BE ABLE TO CALCULATE MOVING AVERAGE PER STOCK FOR CLOSING PRICE*/
  }

  /** To calculate the simple moving average we have to make a sliding window of the
    * last five time stamps and then take the average of this*/
  def simpleMA(rdd: RDD[Double]): RDD[Double] ={
    sc.parallelize(rdd)
      .sliding(5)
      .map(curSlice => (curSlice.sum/curSlice.size))
      .collect()
  }

}

来源:https://stackoverflow.com/questions/33650880/spark-structure-of-intraday-data-of-100-stocks-to-efficiently-calculate-moving

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!