apache-spark-1.5

Spark 1.5.0 spark.app.id warning

谁说我不能喝 提交于 2020-01-16 01:52:07
问题 I had updated my CDH cluster to use spark 1.5.0 . When I submit spark application, the system show warning about spark.app.id Using default name DAGScheduler for source because spark.app.id is not set. I have searched about spark.app.id but not document about it. I read this link and I think It is used for RestAPI call. I don't see this warning in spark 1.4 . Could someone explain it to me and show how to set it? 回答1: It's not necessarily used for the REST API, but rather for monitoring

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

爱⌒轻易说出口 提交于 2019-12-17 07:41:51
问题 I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark. Here is the code: JavaSparkContext sc = new JavaSparkContext( new SparkConf().setAppName("MySQLtoEs") .set("es.index.auto.create", "true") .set("es.nodes", "127.0.0.1:9200") .set("es.mapping.id", "id") .set("spark.serializer", KryoSerializer.class.getName())); SQLContext sqlContext = new SQLContext(sc); // Data source options Map<String, String> options = new HashMap<>(); options.put("driver", MYSQL

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

喜夏-厌秋 提交于 2019-12-08 22:28:28
问题 I want to transpose following table using spark scala without Pivot function I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table: Customer Day Sales 1 Mon 12 1 Tue 10 1 Thu 15 1 Fri 2 2 Sun 10 2 Wed 5 2 Thu 4 2 Fri 3 Output table : Customer Sun Mon Tue Wed Thu Fri 1 0 12 10 0 15 2 2 10 0 0 5 4 3 Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6: var Trans = Cust

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

廉价感情. 提交于 2019-12-07 05:53:51
问题 I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

回眸只為那壹抹淺笑 提交于 2019-12-06 10:59:32
问题 Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures,

Options to read large files (pure text, xml, json, csv) from hdfs in RStudio with SparkR 1.5

余生颓废 提交于 2019-12-05 10:54:16
I am new to Spark and would like to know if there are other options than those ones below to read data stored in a hdfs from RStudio using SparkR or if I use them correctly. The data could be any kind (pure text, csv, json, xml or any database containing relational tables) and of any size (1kb - several gb). I know that textFile(sc, path) should no more be used, but are there other possibilities to read such kinds of data besides the read.df function? The following code uses the read.df and jsonFile but jsonFile produces an error: Sys.setenv(SPARK_HOME = "C:\\Users\\--\\Downloads\\spark-1.5.0

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

二次信任 提交于 2019-12-04 18:05:24
Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(nb.smoothing, Array(0.01, 0.1, 1)).build() val cv = new CrossValidator()

Saving / exporting transformed DataFrame back to JDBC / MySQL

﹥>﹥吖頭↗ 提交于 2019-12-04 11:32:33
问题 I'm trying to figure out how to use the new DataFrameWriter to write data back to a JDBC database. I can't seem to find any documentation for this, although looking at the source code it seems like it should be possible. A trivial example of what I'm trying looks like this: sqlContext.read.format("jdbc").options(Map( "url" -> "jdbc:mysql://localhost/foo", "dbtable" -> "foo.bar") ).select("some_column", "another_column") .write.format("jdbc").options(Map( "url" -> "jdbc:mysql://localhost/foo",

“INSERT INTO …” with SparkSQL HiveContext

拥有回忆 提交于 2019-12-03 14:56:35
问题 I'm trying to run an insert statement with my HiveContext, like this: hiveContext.sql('insert into my_table (id, score) values (1, 10)') The 1.5.2 Spark SQL Documentation doesn't explicitly state whether this is supported or not, although it does support "dynamic partition insertion". This leads to a stack trace like AnalysisException: Unsupported language features in query: insert into my_table (id, score) values (1, 10) TOK_QUERY 0, 0,20, 0 TOK_FROM 0, -1,20, 0 TOK_VIRTUAL_TABLE 0, -1,20, 0

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

女生的网名这么多〃 提交于 2019-11-29 08:05:37
I want to transpose following table using spark scala without Pivot function I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table: Customer Day Sales 1 Mon 12 1 Tue 10 1 Thu 15 1 Fri 2 2 Sun 10 2 Wed 5 2 Thu 4 2 Fri 3 Output table : Customer Sun Mon Tue Wed Thu Fri 1 0 12 10 0 15 2 2 10 0 0 5 4 3 Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6: var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales") Not sure how efficient that is, but you can use