apache-spark-2.0

Pass system property to spark-submit and read file from classpath or custom path

拈花ヽ惹草 提交于 2019-12-05 03:27:39
I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit ). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have already found a way to load it during local execution: What I have so far Basically, checking for System property logback.configurationFile , but loading logback.xml from my /src/main/resources/ just in case: // the same as default: https://logback.qos.ch/manual/configuration.html private val LogbackLocation = Option(System.getProperty("logback

Timeout Exception in Apache-Spark during program Execution

◇◆丶佛笑我妖孽 提交于 2019-12-05 02:40:48
I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following exception after running a small number of iterations, around 3000 iterations. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc

Apache Spark vs Apache Spark 2 [closed]

你。 提交于 2019-12-05 01:05:57
What are the improvements Apache Spark2 brings compared to Apache Spark? From architecture perspective From application point of view or more Apache Spark 2.0.0 APIs have stayed largely similar to 1.X, Spark 2.0.0 does have API breaking changes Apache Spark 2.0.0 is the first release on the 2.x line. The major updates are API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements . New in spark 2: The biggest change that I can see is that DataSet and DataFrame APIs will be merged. The latest and greatest from Spark will

spark off heap memory config and tungsten

心已入冬 提交于 2019-12-05 00:37:48
I thought that with the integration of project Tungesten, spark would automatically use off heap memory. What for are spark.memory.offheap.size and spark.memory.offheap.enabled? Do I manually need to specify the amount of off heap memory for Tungsten here? Spark/Tungsten use Encoders/Decoders to represent JVM objects as a highly specialized Spark SQL Types objects which then can be serialized and operated on in a highly performant way. Internal format representation is highly efficient and friendly to GC memory utilization. Thus, even operating in the default on-heap mode Tungsten alleviates

How to mask columns using Spark 2?

丶灬走出姿态 提交于 2019-12-04 21:21:11
I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file. For example, for employee table as shown below +----+------+-----+---------+ | id | name | age | address | +----+------+-----+---------+ | 1 | abcd | 21 | India | +----+------+-----+---------+ | 2 | qazx | 42 | Germany | +----+------+-----+---------+ if we want to mask name and age columns then I get these columns in an sequence. val mask = Seq("name", "age") Expected values after masking are: +----+----------------+-----------

dynamically bind variable/parameter in Spark SQL?

回眸只為那壹抹淺笑 提交于 2019-12-04 18:07:39
问题 How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println) 回答1: Spark SQL (as of 1.6 release) does not support bind variables. ps. What Ashrith is suggesting is not a bind variable.. You're constructing a string every time. Evey time Spark will parse the query, create execution plan etc. Purpose of bind variables (in RDBMS systems for example) is to

How to write Spark Structured Streaming Data into Hive?

自古美人都是妖i 提交于 2019-12-04 13:50:59
How to write Spark Structured Streaming Data into Hive? There is df.write().saveAsTable(tablename) however I am not sure if this writes streaming data I normally do df.writeStream().trigger(new ProcessingTime(1000)).foreach(new KafkaSink()).start() to write streaming data into Kafka but I don't see anything similar to write streaming data into Hive data warehouse. any ideas? df.createOrReplaceTempView("mytable") spark.sql("create table mytable as select * from mytable"); or df.write().mode(SaveMode.Overwrite).saveAsTable("dbName.tableName"); 来源: https://stackoverflow.com/questions/45796006/how

Apache spark join with dynamic re-partitionion

五迷三道 提交于 2019-12-04 06:10:05
问题 I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception. I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199 . SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million. While SELECT COUNT(*) FROM ReportDs = 57million. Cluster details CPU: 40 cores

java.lang.IllegalStateException: Error reading delta file, spark structured streaming with kafka

跟風遠走 提交于 2019-12-03 08:56:10
I am using Structured Streaming + Kafka for realtime data analytics in our project. I am using Spark 2.2, kafka 0.10.2. I am facing an issue during streaming query recovery from checkpoint at application startup. As there are multiple streaming queries derived from a single kafka streaming point and there are different checkpint directories for every streaming query. So in case of job failure, when we restart the job there are some streaming queries which fails to recover from checkpoint location hence throw an exception of Error reading delta file . Here are the logs : Job aborted due to

Apache Spark Dataframe - Load data from nth line of a CSV file

丶灬走出姿态 提交于 2019-12-02 04:04:51
I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d," m,Version,v1.0 m,Type,xx m,<OtherMetaData>,<...> h,Col1,Col2,Col3,Col4,Col5,.............,Col100 m,Mandatory,Optional,Optional,...........,Mandatory d,Val1,Val2,Val3,Val4,Val5,.............,Val100 Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet? Dataset<Row> df = spark.read() .format("csv")