apache-spark-1.5 | 易学教程

Spark 1.5.0 spark.app.id warning

阅读更多关于 Spark 1.5.0 spark.app.id warning

问题 I had updated my CDH cluster to use spark 1.5.0 . When I submit spark application, the system show warning about spark.app.id Using default name DAGScheduler for source because spark.app.id is not set. I have searched about spark.app.id but not document about it. I read this link and I think It is used for RestAPI call. I don't see this warning in spark 1.4 . Could someone explain it to me and show how to set it? 回答1: It's not necessarily used for the REST API, but rather for monitoring

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

阅读更多关于 Save Spark Dataframe into Elasticsearch - Can’t handle type exception

问题 I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark. Here is the code: JavaSparkContext sc = new JavaSparkContext( new SparkConf().setAppName("MySQLtoEs") .set("es.index.auto.create", "true") .set("es.nodes", "127.0.0.1:9200") .set("es.mapping.id", "id") .set("spark.serializer", KryoSerializer.class.getName())); SQLContext sqlContext = new SQLContext(sc); // Data source options Map<String, String> options = new HashMap<>(); options.put("driver", MYSQL

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

阅读更多关于 How to transpose dataframe in Spark 1.5 (no pivot operator available)?

问题 I want to transpose following table using spark scala without Pivot function I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table: Customer Day Sales 1 Mon 12 1 Tue 10 1 Thu 15 1 Fri 2 2 Sun 10 2 Wed 5 2 Thu 4 2 Fri 3 Output table : Customer Sun Mon Tue Wed Thu Fri 1 0 12 10 0 15 2 2 10 0 0 5 4 3 Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6: var Trans = Cust

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

阅读更多关于 Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

问题 I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

阅读更多关于 How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

问题 Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures,

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

阅读更多关于 Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and jsonFile but jsonFile produces an error: Sys.setenv(SPARK_HOME = "C:\\Users\\--\\Downloads\\spark-1.5.0

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

阅读更多关于 How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build() val cv = new CrossValidator()

Saving / exporting transformed DataFrame back to JDBC / MySQL

阅读更多关于 Saving / exporting transformed DataFrame back to JDBC / MySQL

问题 I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible. A trivial example of what I'm trying looks like this: sqlContext.read.format("jdbc").options(Map( "url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar") ).select("some_column", "another_column") .write.format("jdbc").options(Map( "url" -> "jdbc:mysql://localhost/foo",

“INSERT INTO …” with SparkSQL HiveContext

阅读更多关于 “INSERT INTO …” with SparkSQL HiveContext

问题 I'm trying to run an insert statement with my HiveContext, like this: hiveContext.sql('insert into my_table (id, score) values (1, 10)') The 1.5.2 Spark SQL Documentation doesn't explicitly state whether this is supported or not, although it does support "dynamic partition insertion". This leads to a stack trace like AnalysisException: Unsupported language features in query: insert into my_table (id, score) values (1, 10) TOK_QUERY 0, 0,20, 0 TOK_FROM 0, -1,20, 0 TOK_VIRTUAL_TABLE 0, -1,20, 0

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

阅读更多关于 How to transpose dataframe in Spark 1.5 (no pivot operator available)?

I want to transpose following table using spark scala without Pivot function I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table: Customer Day Sales 1 Mon 12 1 Tue 10 1 Thu 15 1 Fri 2 2 Sun 10 2 Wed 5 2 Thu 4 2 Fri 3 Output table : Customer Sun Mon Tue Wed Thu Fri 1 0 12 10 0 15 2 2 10 0 0 5 4 3 Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6: var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales") Not sure how efficient that is, but you can use