How to skip more then one lines of header in RDD in Spark

后端 未结 3 1610
攒了一身酷
攒了一身酷 2021-01-14 16:15

Data in my first RDD is like

1253
545553
12344896
1 2 1
1 43 2
1 46 1
1 53 2

Now the first 3 integers are some counters that I need to bro

相关标签:
3条回答
  • 2021-01-14 16:19

    First take the values using take() method as zero323 suggested

    raw  = sc.textfile("file.txt")
    headers = raw.take(3)
    

    Then

    final_raw = raw.filter(lambda x: x != headers)
    

    and done.

    0 讨论(0)
  • 2021-01-14 16:27
    1. Imports for Python 2

      from __future__ import print_function
      
    2. Prepare dummy data:

      s = "1253\n545553\n12344896\n1 2 1\n1 43 2\n1 46 1\n1 53 2"
      with open("file.txt", "w") as fw: fw.write(s)
      
    3. Read raw input:

      raw = sc.textFile("file.txt")
      
    4. Extract header:

      header = raw.take(3)
      print(header)
      ### [u'1253', u'545553', u'12344896']
      
    5. Filter lines:

      • using zipWithIndex

        content = raw.zipWithIndex().filter(lambda kv: kv[1] > 2).keys()
        print(content.first())
        ## 1 2 1
        
      • using mapPartitionsWithIndex

        from itertools import islice
        
        content = raw.mapPartitionsWithIndex(
            lambda i, iter: islice(iter, 3, None) if i == 0 else iter)
        
        print(content.first())
        ## 1 2 1
        

    NOTE: All credit goes to pzecevic and Sean Owen (see linked sources).

    0 讨论(0)
  • 2021-01-14 16:44

    In my case I have a csv file like below

    ----- HEADER START -----
    We love to generate headers
    #who needs comment char?
    ----- HEADER END -----
    
    colName1,colName2,...,colNameN
    val__1.1,val__1.2,...,val__1.N
    

    Took me a day to figure out

    val rdd = spark.read.textFile(pathToFile)  .rdd
      .zipWithIndex() // get tuples (line, Index)
      .filter({case (line, index) => index > numberOfLinesToSkip})
      .map({case (line, index) => l}) //get rid of index
    val ds = spark.createDataset(rdd) //convert rdd to dataset
    val df=spark.read.option("inferSchema", "true").option("header", "true").csv(ds) //parse csv
    

    Sorry code in scala, however can be easily converted to python

    0 讨论(0)
提交回复
热议问题