Splitting strings in Apache Spark using Scala

前端 未结 4 940
独厮守ぢ
独厮守ぢ 2021-02-03 14:24

I have a dataset, which contains lines in the format (tab separated):

Title<\\t>Text

Now for every word in Text, I want to c

相关标签:
4条回答
  • The answer which proved above is not good enough. .map( line => line.split("\t") ) may cause:

    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 4 times, most recent failure: Lost task 0.3 in stage 18.0 (TID 1485, ip-172-31-113-181.us-west-2.compute.internal, executor 10): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 14

    in case the last column is empty. the best result explained here - Split 1 column into 3 columns in spark scala

    0 讨论(0)
  • 2021-02-03 15:15

    Another version with DataFrame API

    // read into DataFrame
    val viewsDF=spark.read.text("s3n://file.txt")
    
    // Split
    val splitedViewsDF = viewsDF.withColumn("col1", split($"value", "\\t").getItem(0)).withColumn("col2", split($"value", "\\s+").getItem(1)).drop($"value"))
    

    Sample

    scala> val viewsDF=spark.read.text("spark-labs/data/wiki-pageviews.txt")
    viewsDF: org.apache.spark.sql.DataFrame = [value: string]
    
    scala> viewsDF.printSchema
    root
     |-- value: string (nullable = true)
    
    
    scala> viewsDF.limit(5).show
    +------------------+
    |             value|
    +------------------+
    |  aa Main_Page 3 0|
    |  aa Main_page 1 0|
    |  aa User:Savh 1 0|
    |  aa Wikipedia 1 0|
    |aa.b User:Savh 1 0|
    +------------------+
    
    
    scala> val splitedViewsDF = viewsDF.withColumn("col1", split($"value", "\\s+").getItem(0)).withColumn("col2", split($"value", "\\s+").getItem(1)).withColumn("col3", split($"value", "\\s+").getItem(2)).drop($"value")
    splitedViewsDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
    
    scala>
    
    scala> splitedViewsDF.printSchema
    root
     |-- col1: string (nullable = true)
     |-- col2: string (nullable = true)
     |-- col3: string (nullable = true)
    
    
    scala> splitedViewsDF.limit(5).show
    +----+---------+----+
    |col1|     col2|col3|
    +----+---------+----+
    |  aa|Main_Page|   3|
    |  aa|Main_page|   1|
    |  aa|User:Savh|   1|
    |  aa|Wikipedia|   1|
    |aa.b|User:Savh|   1|
    +----+---------+----+
    
    
    scala>
    
    0 讨论(0)
  • 2021-02-03 15:18

    This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:

    val df = spark.read
      .option("delimiter", "\t")
      .option("header", false)
      .csv("s3n://file.txt")
      .toDF("title", "text")
    

    Then, split the text column on space and explode to get one word per row.

    val df2 = df.select($"title", explode(split($"text", " ")).as("words"))
    

    Finally, group on the title column and count the number of words for each.

    val countDf = df2.groupBy($"title").agg(count($"words"))
    
    0 讨论(0)
  • 2021-02-03 15:21

    So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

    val fileRdd = sc.textFile("s3n://file.txt")
    // RDD[ String ]
    
    val splitRdd = fileRdd.map( line => line.split("\t") )
    // RDD[ Array[ String ]
    
    val yourRdd = splitRdd.flatMap( arr => {
      val title = arr( 0 )
      val text = arr( 1 )
      val words = text.split( " " )
      words.map( word => ( word, title ) )
    } )
    // RDD[ ( String, String ) ]
    
    // Now, if you want to print this...
    yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
    
    // if you want to count ( this count is for non-unique words), 
    val countRdd = yourRdd
      .groupBy( { case ( word, title ) => title } )  // group by title
      .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title
    
    0 讨论(0)
提交回复
热议问题