apache-spark-2.0

Spark 2.2 Thrift server error on dataframe NumberFormatException when query Hive table

落花浮王杯 提交于 2019-12-10 11:43:19
问题 I have Hortonworks HDP 2.6.3 running Spark2 (v2.2). My test case is very simple: Create a Hive table with some random values. Hive at port 10000 Turn on Spark Thrift server at 10016 Run pyspark and query the Hive table via 10016 However, I was unable to get the data from Spark due to NumberFormatException. Here is my test case: Create Hive table with sample rows: beeline> !connect jdbc:hive2://localhost:10000/default hive hive create table test1 (id int, desc varchar(40)); insert into table

Saving The RDD pair in particular format in the output file

大城市里の小女人 提交于 2019-12-10 11:33:10
问题 I have a JavaPairRDD lets say data of type <Integer,List<Integer>> when i do data.saveAsTextFile("output") The output will contain the data in the following format: (1,[1,2,3,4]) etc... I want something like this in the output file : 1 1,2,3,4 i.e. 1\t1,2,3,4 Any help would be appreciated 回答1: You need to understand what's happening here. You have an RDD[T,U] where T and U are some obj types, read it as RDD of Tuple of T and U. On this RDD when you call saveAsTextFile() , it essentially

Apache Spark vs Apache Spark 2 [closed]

无人久伴 提交于 2019-12-10 01:24:09
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more 回答1: Apache Spark 2.0.0 APIs have stayed largely similar to 1.X, Spark 2.0.0 does have API breaking changes Apache Spark 2.0.0 is the first

How to mask columns using Spark 2?

喜欢而已 提交于 2019-12-10 00:30:40
问题 I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file. For example, for employee table as shown below +----+------+-----+---------+ | id | name | age | address | +----+------+-----+---------+ | 1 | abcd | 21 | India | +----+------+-----+---------+ | 2 | qazx | 42 | Germany | +----+------+-----+---------+ if we want to mask name and age columns then I get these columns in an

Spark sql issue with columns specified

雨燕双飞 提交于 2019-12-08 21:24:30
we are trying to replicate an oracle db into hive. We get the queries from oracle and run them in hive. So, we get them in this format: INSERT INTO schema.table(col1,col2) VALUES ('val','val'); While this query works in Hive directly, when I use spark.sql, I get the following error: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'emp_id' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 20) == SQL == insert into ss.tab(emp_id,firstname,lastname) values ('1','demo','demo') --------------------^^^ at org.apache.spark.sql.catalyst

Livy Server: return a dataframe as JSON?

∥☆過路亽.° 提交于 2019-12-08 16:33:33
问题 I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements , with the following body { "code": "spark.sql(\"select * from test_table limit 10\")" } I would like an answer in the following format (...) "data": { "application/json": "[ {"id": "123", "init_date": 1481649345, ...}, {"id": "133", "init_date": 1481649333, ...}, {"id": "155", "init_date": 1481642153, ...}, ]" } (...) but what I'm getting is (...) "data": { "text/plain": "res0: org.apache

Transforming Spark SQL AST with extraOptimizations

我怕爱的太早我们不能终老 提交于 2019-12-08 08:49:16
问题 I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query. I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations . I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable.

Launching Apache Spark SQL jobs from multi-threaded driver

爷,独闯天下 提交于 2019-12-07 20:58:43
问题 I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables. From official spark site https://spark.apache.org/docs/latest/job-scheduling.html it's clear that this can work... ...cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application,

How to use dataset to groupby

痴心易碎 提交于 2019-12-07 05:39:43
问题 I have a request to use rdd to do so: val test = Seq(("New York", "Jack"), ("Los Angeles", "Tom"), ("Chicago", "David"), ("Houston", "John"), ("Detroit", "Michael"), ("Chicago", "Andrew"), ("Detroit", "Peter"), ("Detroit", "George") ) sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println) The result is that: (New York,List(Jack)) (Detroit,List(Michael, Peter, George)) (Los Angeles,List(Tom)) (Houston,List(John)) (Chicago,List(David, Andrew)) How to do it use dataset with

Out of Memory Error when Reading large file in Spark 2.1.0

会有一股神秘感。 提交于 2019-12-07 05:35:56
问题 I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file. But I always get a java.lang.OutOfMemoryError: Java heap space no matter how I tweak this. I want to understand why doesn't increasing the number of partitions stop the OOM error Shouldn't it split the task into more parts so that each individual part is smaller and doesn't cause memory