Is there a way to skip/throw-out/ignore records in Spark during a map?

后端 未结 2 821
忘了有多久
忘了有多久 2021-02-14 04:13

We have a very standard Spark job which reads log files from s3 and then does some processing over them. Very basic Spark stuff...

val logs = sc.textFile(somePat         


        
2条回答
  •  臣服心动
    2021-02-14 04:53

    You could make the parser return an Option[Value] instead of a Value. That way you could use flatMap to map the lines to rows and remove those that were invalid.

    In rough lines something like this:

    def parseLog(line:String):Option[Array[String]] = {
        val splitted = log.split("\t")
        if (validate(splitted)) Some(splitted) else None
    }
    
    val validRows = logs.flatMap(OurRowObject.parseLog(_))
    

提交回复
热议问题