How to output data with Hive-style directory structure in Scalding?

前端 未结 1 440
执笔经年
执笔经年 2021-01-21 10:18

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like \"state=CA

相关标签:
1条回答
  • 2021-01-21 11:11

    Sorry previous example was a pseudocode. Below I will give a small code with input data example.

    Please note that this only works with Scalding version 0.12.0 or above

    Let's image we have input as below which define some purchase data,

    user1   1384034400  6   75
    user1   1384038000  6   175
    user2   1383984000  48  3
    user3   1383958800  48  281
    user3   1384027200  9   7
    user3   1384027200  9   11
    user4   1383955200  37  705
    user4   1383955200  37  15
    user4   1383969600  36  41
    user4   1383969600  36  21
    

    Tab separated and the 3rd column is a State number. Here we have integer but for string based States you can easily adapt.

    This code will read the input and put them in 'State=stateid' output folder buckets.

    class TemplatedTsvExample(args: Args) extends Job(args) {
    
      val purchasesPath = args("purchases")
      val outputPath    = args("output")
    
      // defines both input & output schema, you can also make separate for each of them
      val ioSchema = ('USERID, 'TIMESTAMP, 'STATE, 'PURCHASE)
    
      val Purchases =
         Tsv(purchasesPath, ioSchema)
         .read
         .map('STATE -> 'STATENAME) { state: Int => "State=" + state } // here you can make necessary changes
         .groupBy('STATENAME) { _.pass } // this is optional
         .write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ioSchema))
    } 
    

    I hope this is helpful. Please ask me if anything is not clear.

    You can find full code here.

    0 讨论(0)
提交回复
热议问题