Does spark-sql support multiple delimiters in the input data?

前端 未结 1 1508
Happy的楠姐
Happy的楠姐 2021-01-25 15:16

I have an input data with multiple single character delimiters as followed :

col1data1\"col2data1;col3data1\"col4data1
col1data2\"col2data2;col3data2\"col4data2
         


        
1条回答
  •  别那么骄傲
    2021-01-25 16:01

    The answer is no, spark-sql does not support multi-delimiter but one way to do it is trying to read it your file into an RDD and than parse it using regular splitting methods :

    val rdd : RDD[String] = ???
    val s = rdd.first()
    // res1: String = "This is one example. This is another"
    

    Let's say that you want to split on space and point break.

    so we can consider apply our function on our s value as followed :

    s.split(" |\\.")
    // res2: Array[String] = Array(This, is, one, example, "", This, is, another)
    

    now we can apply the function on the whole rdd :

    rdd.map(_.split(" |\\."))
    

    Example on your data :

    scala> val s = "col1data1\"col2data1;col3data1\"col4data1"
    scala> s.split(";|\"")
    res4: Array[String] = Array(col1data1, col2data1, col3data1, col4data1)
    

    More on string splitting :

    • A Scala split String example.
    • How to split String in Scala but keep the part matching the regular expression?

    Just remember that everything you can apply on a regular data type you can apply on a whole RDD, then all you have to do is converting your RDD to a DataFrame.

    0 讨论(0)
提交回复
热议问题