Dropping the first and last row of an RDD with Spark

限于喜欢 提交于 2020-05-26 09:26:30

问题


I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?


回答1:


One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.




回答2:


This might be a lighter version:

val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) iter.drop(1)
  else if (idx == partitions - 1) iter.sliding(2).map(_.head)
  else iter
}

scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)



回答3:


Here is my take on it, may require an action(count), expected results always and independent to number of partitions.

val rddRowCount = rdd.count()
val rddWithIndices = rdd.zipWithIndex()
val filteredRddWithIndices = rddWithIndices.filter(eachRow =>
  if(eachRow._2 == 0) false
  else if(eachRow._2 == rddRowCount - 1) false
  else true
)
val finalRdd = filteredRddWithIndices.map(eachRow => eachRow._1)


来源:https://stackoverflow.com/questions/45105739/dropping-the-first-and-last-row-of-an-rdd-with-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!