Dropping the first and last row of an RDD with Spark

问题

I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?

回答1:

One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.

回答2:

This might be a lighter version:

val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) iter.drop(1)
  else if (idx == partitions - 1) iter.sliding(2).map(_.head)
  else iter
}

scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)

回答3:

Here is my take on it, may require an action(count), expected results always and independent to number of partitions.

val rddRowCount = rdd.count()
val rddWithIndices = rdd.zipWithIndex()
val filteredRddWithIndices = rddWithIndices.filter(eachRow =>
  if(eachRow._2 == 0) false
  else if(eachRow._2 == rddRowCount - 1) false
  else true
)
val finalRdd = filteredRddWithIndices.map(eachRow => eachRow._1)

来源：https://stackoverflow.com/questions/45105739/dropping-the-first-and-last-row-of-an-rdd-with-spark

标签

scala

apache-spark

rdd