We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like \"state=CA
Sorry previous example was a pseudocode. Below I will give a small code with input data example.
Please note that this only works with Scalding version 0.12.0 or above
Let's image we have input as below which define some purchase data,
user1 1384034400 6 75
user1 1384038000 6 175
user2 1383984000 48 3
user3 1383958800 48 281
user3 1384027200 9 7
user3 1384027200 9 11
user4 1383955200 37 705
user4 1383955200 37 15
user4 1383969600 36 41
user4 1383969600 36 21
Tab separated and the 3rd column is a State number. Here we have integer but for string based States you can easily adapt.
This code will read the input and put them in 'State=stateid' output folder buckets.
class TemplatedTsvExample(args: Args) extends Job(args) {
val purchasesPath = args("purchases")
val outputPath = args("output")
// defines both input & output schema, you can also make separate for each of them
val ioSchema = ('USERID, 'TIMESTAMP, 'STATE, 'PURCHASE)
val Purchases =
Tsv(purchasesPath, ioSchema)
.read
.map('STATE -> 'STATENAME) { state: Int => "State=" + state } // here you can make necessary changes
.groupBy('STATENAME) { _.pass } // this is optional
.write(TemplatedTsv(outputPath, "%s", 'STATENAME, false, SinkMode.REPLACE, ioSchema))
}
I hope this is helpful. Please ask me if anything is not clear.
You can find full code here.