Filter RDD based on row_number

后端 未结 1 1493
Happy的楠姐
Happy的楠姐 2021-02-08 05:02

sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,...).

in the \"Learning Spark\" O\'Reilly e-boo

1条回答
  •  有刺的猬
    2021-02-08 05:38

    Don't worry about loading the rows/lines you don't need. When you do:

    input = sc.textFile(inputFile)
    

    you are not loading the file. You are just getting an object that will allow you to operate on the file. So to be efficient, it is better to think in terms of getting only what you want. For example:

    header = input.take(1)[0]
    rows = input.filter(lambda line: line != header)
    

    Note that here I am not using an index to refer to the line I want to drop but rather its value. This has the side effect that other lines with this value will also be ignored but is more in the spirit of Spark as Spark will distribute your text file in different parts across the nodes and the concept of line numbers gets lost in each partition. This is also the reason why this is not easy to do in Spark(Hadoop) as each partition should be considered independent and a global line number would break this assumption.

    If you really need to work with line numbers I recommend that you add them to the file outside of Spark(see here) and then just filter by this column inside of Spark.

    Edit: Added zipWithIndex solution as suggested by @Daniel Darabos.

    sc.textFile('test.txt')\
      .zipWithIndex()\            # [(u'First', 0), (u'Second', 1), ...
      .filter(lambda x: x[1]!=5)\ # select columns
      .map(lambda x: x[0])\       # [u'First', u'Second'
      .collect()
    

    0 讨论(0)
提交回复
热议问题