scalding | 易学教程

Cascading + libjars = ClassNotFoundException. Sometimes

阅读更多关于 Cascading + libjars = ClassNotFoundException. Sometimes

问题 I am running Cascading (actually Scalding) hadoop job that uses DistributedCache for dependent jars. Fist time it works fine (meaning that the classpath is set up correctly) but then it starts failing with ClassNotFoundException: java.io.IOException: Split class cascading.tap.hadoop.io.MultiInputSplit not found at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask

Cascading + libjars = ClassNotFoundException. Sometimes

阅读更多关于 Cascading + libjars = ClassNotFoundException. Sometimes

Execution monad

阅读更多关于 Execution monad

问题 I know monad is the general concept. What about Execution monad. Is it a general concept or design Patten which can be used outside scalding too. I have seen new version of scalding is having execution monads. 回答1: It is just one specific monad, which is part of scalding. So, not a general concept. You could use similar monads in other contexts outside of scalding, but the exact monad would be different, and the term "execution monad" doesn't seem to be commonly used except to refer

Create Scalding Source like TextLine that combines multiple files into single mappers

阅读更多关于 Create Scalding Source like TextLine that combines multiple files into single mappers

问题 We have many small files that need combining. In Scalding you can use TextLine to read files as text lines. The problem is we get 1 mapper per file , but we want to combine multiple files so that they are processed by 1 mapper. I understand we need to change the input format to an implementation of CombineFileInputFormat , and this may involve using cascadings CombinedHfs . We cannot work out how to do this, but it should be just a handful of lines of code to define our own Scalding source

How to declare dependency on Scalding in sbt project?

阅读更多关于 How to declare dependency on Scalding in sbt project?

问题 I am trying to figure out how to create an build.sbt file for my own Scalding-based project. Scalding source structure has no build.sbt file. Instead it has project/Build.scala build definition. What would be the right way to integrate my own sbt project with Scalding, so I could also import it later in Eclipse with sbt-eclipse plugin? Update: For the following code: import cascading.tuple.Fields import com.twitter.scalding._ class Scan(args: Args) extends Job(args) { val output = TextLine(

How to output data with Hive-style directory structure in Scalding?

阅读更多关于 How to output data with Hive-style directory structure in Scalding?

问题 We are using Scalding to do ETL and generate the output as a Hive table with partitions. Consequently, we want the directory names for partitions to be something like "state=CA" for example. We are using TemplatedTsv as follows: pipe // some other ETL .map('STATE -> 'hdfs_state) { state: Int => "State=" + state } .groupBy('hdfs_state) { _.pass } .write(TemplatedTsv(baseOutputPath, "%s", 'hdfs_state, writeHeader = false, sinkMode = SinkMode.UPDATE, fields = ('all except 'hdfs_state))) We adopt

Transforming matrix format, scalding

阅读更多关于 Transforming matrix format, scalding

问题 Ok, so, in scalding we can easily work with matrix, using matrix api, and it is ok - in a such way: val matrix = Tsv(path, ('row, 'col, 'val)) .read .toMatrix[Long,Long,Double]('row, 'col, 'val) But how can I transform matrix to that format from format, like we usually write? Are there some elegant ways? 1 2 3 3 4 5 5 6 7 to 1 1 1 1 2 2 1 3 3 2 1 3 2 2 4 2 3 5 3 1 5 3 2 6 3 3 7 I need this to make operations on matrix with huge sizes, and I don't know the number of rows and columns (it is

Scalding, flatten fields after groupBy

阅读更多关于 Scalding, flatten fields after groupBy

问题 I see this: Scalding: How to retain the other field, after a groupBy('field){.size}? it's a real pain and a mess comparing to Apache Pig... What do I do wrong? Can I do the same like GENERATE(FLATTEN()) pig? I'm confused. Here is my scalding code: def takeTop(topAmount: Int) :Pipe = self .groupBy(person1){ _.sortedReverseTake[Long](activityCount -> top, topAmount)} .flattenTo[(Long, Long, Long)](top -> (person1, person2, activityCount)) And my test: "Take top 3" should "return most active

Can I run spark unit tests within eclipse

阅读更多关于 Can I run spark unit tests within eclipse

问题 Recently we moved from using scalding to spark. I used eclipse and the scala IDE for eclipse to write code and tests. The tests ran fine with twitter's JobTest class. Any class using JobTest would be automatically available to run as a scala unit test within eclipse. I've lost that ability now. The spark test cases are perfectly runnable using sbt, but the run configuration in eclipse for these tests lists 'none applicable'. Is there a way to run spark unit tests within eclipse? 回答1: I think

How to override setup and cleanup methods in spark map function

阅读更多关于 How to override setup and cleanup methods in spark map function

问题 Suppose there is following map reduce job Mapper : setup() initializes some state map() add data to state, no output cleanup() ouput state to context Reducer : aggregare all states into one output How such job could be implemented in spark? Additional question: how such job could be implemented in scalding? I'm looking for example wich somehow makes the method overloadings... 回答1: Spark map doesn't provide an equivalent of Hadoop setup and cleanup . It assumes that each call is independent