scalding | 易学教程

Compress Output Scalding / Cascading TsvCompressed

阅读更多关于 Compress Output Scalding / Cascading TsvCompressed

问题 So people have been having problems compressing the output of Scalding Jobs including myself. After googling I get the odd hiff of an answer in a some obscure forum somewhere but nothing suitable for peoples copy and paste needs. I would like an output like Tsv , but writes compressed output. 回答1: Anyway after much faffification I managed to write a TsvCompressed output which seems to do the job (you still need to set the hadoop job system configuration properties, i.e. set compress to true,

Alternatives to scalding for HBase access from Scala (or Java)

阅读更多关于 Alternatives to scalding for HBase access from Scala (or Java)

问题 Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application? By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is

Can I output a collection instead of a tuple in Scalding map method?

阅读更多关于 Can I output a collection instead of a tuple in Scalding map method?

问题 If you want to create a pipe with more than 22 fields from a smaller one in Scalding you are limited by Scala tuples, which cannot have more than 22 items. Is there a way to use collections instead of tuples? I imagine something like in the following example, which sadly doesn't work: input.read.mapTo('line -> aLotOfFields) { line: String => (1 to 24).map(_.toString) }.write(output) 回答1: actually you can. It's in FAQ - https://github.com/twitter/scalding/wiki/Frequently-asked-questions#what

Write to multiple outputs by key Scalding Hadoop, one MapReduce Job

阅读更多关于 Write to multiple outputs by key Scalding Hadoop, one MapReduce Job

问题 How can you write to multiple outputs dependent on the key using Scalding(/cascading) in a single Map Reduce Job. I could of course use .filter for all the possible keys, but that is a horrible hack, which will fire up many jobs. 回答1: There is TemplatedTsv in Scalding (from version 0.9.0rc16 and up), exactly same as Cascading TemplateTsv. Tsv(args("input"), ('COUNTRY, 'GDP)) .read .write(TemplatedTsv(args("output"), "%s", 'COUNTRY)) // it will create a directory for each country under "output

How to declare dependency on Scalding in sbt project?

阅读更多关于 How to declare dependency on Scalding in sbt project?

I am trying to figure out how to create an build.sbt file for my own Scalding -based project. Scalding source structure has no build.sbt file. Instead it has project/Build.scala build definition. What would be the right way to integrate my own sbt project with Scalding, so I could also import it later in Eclipse with sbt-eclipse plugin? Update: For the following code: import cascading.tuple.Fields import com.twitter.scalding._ class Scan(args: Args) extends Job(args) { val output = TextLine("tmp/out.txt") val wordsList = List( ("john"), ("liza"), ("nina"), ("x")) val orderedPipe =

How to bucket outputs in Scalding

阅读更多关于 How to bucket outputs in Scalding

问题 I'm trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids. So in a plain map reduce code I would use the MultipleOutputs class and I would do something like this in the reducer. protected void reduce(final SomeKey key, final Iterable<SomeValue> values, final Context context) { ... for (SomeValue value: values) { String bucketId = computeBucketIdFrom(...); multipleOutputs.write(key, value, folderName + "/" + bucketId); ...

How to output data with Hive-style directory structure in Scalding?

阅读更多关于 How to output data with Hive-style directory structure in Scalding?

We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows: pipe // some other ETL .map('STATE -> 'hdfs_state) { state: Int => "State=" + state } .groupBy('hdfs_state) { _.pass } .write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state, writeHeader = false, sinkMode = SinkMode.UPDATE, fields = ('all except 'hdfs_state))) We adopt the code sample from How to bucket outputs in Scalding . Here are two issues we have: except can't be

Create Scalding Source like TextLine that combines multiple files into single mappers

阅读更多关于 Create Scalding Source like TextLine that combines multiple files into single mappers

We have many small files that need combining. In Scalding you can use TextLine to read files as text lines. The problem is we get 1 mapper per file , but we want to combine multiple files so that they are processed by 1 mapper. I understand we need to change the input format to an implementation of CombineFileInputFormat , and this may involve using cascadings CombinedHfs . We cannot work out how to do this, but it should be just a handful of lines of code to define our own Scalding source called, say, CombineTextLine . Many thanks to anyone who can provide the code to do this. As a side

Scalding: How to retain the other field, after a groupBy('field){.size}?

阅读更多关于 Scalding: How to retain the other field, after a groupBy('field){.size}?

问题 So my input data has two fields/columns: id1 & id2, and my code is the following: TextLine(args("input")) .read .mapTo('line->('id1,'id2)) {line: String => val fields = line.split("\t") (fields(0),fields(1)) } .groupBy('id2){.size} .write(Tsv(args("output"))) The output results in (what i assume) two fields: id2 * size. I'm a little stuck on finding out if it is possible to retain the id1 value that was also grouped with id2 and add it as another field? 回答1: You can't do this in a nice way I

uncompress and read gzip file in scala

阅读更多关于 uncompress and read gzip file in scala

问题 In Scala, how does one uncompress the text contained in file.gz so that it can be processed? I would be happy with either having the contents of the file stored in a variable, or saving it as a local file so that it can be read in by the program after. Specifically, I am using Scalding to process compressed log data, but Scalding does not define a way to read them in FileSource.scala . 回答1: Here's my version: import java.io.BufferedReader import java.io.InputStreamReader import java.util.zip