spark-streaming

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

…衆ロ難τιáo~ 提交于 2020-05-17 07:10:14
问题 im using ubuntu im using spark dependency using intellij Command 'spark' not found, but can be installed with: .. (when i enter spark in shell) i have two user amine , and hadoop_amine (where hadoop hdfs is set) when i try to save a dataframe to HDFS (spark scala): procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json") i got this error Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied:

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

安稳与你 提交于 2020-05-16 03:11:11
问题 I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue. libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0" I'm using Spark 2.2.0. What are the correct dependencies? 回答1: The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem : As

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

廉价感情. 提交于 2020-05-16 03:09:30
问题 I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue. libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0" I'm using Spark 2.2.0. What are the correct dependencies? 回答1: The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem : As

How to match an RDD[ParentClass] with RDD[Subclass] in apache spark?

余生颓废 提交于 2020-05-13 07:46:10
问题 I have to match an rdd with its types. trait Fruit case class Apple(price:Int) extends Fruit case class Mango(price:Int) extends Fruit Now a dstream of type DStream[Fruit] is coming. It is either Apple or Mango . How to perform operation based on the subclass? Something like the below (which doesn't work): dStream.foreachRDD{rdd:RDD[Fruit] => rdd match { case rdd: RDD[Apple] => //do something case rdd: RDD[Mango] => //do something case _ => println(rdd.count() + "<<<< not matched anything") }

How to calculate size of dataframe in spark scala

邮差的信 提交于 2020-05-12 07:57:35
问题 I want to write one large sized dataframe with repartition so i want to calculate number of repartition for my source dataframe. numberofpartition= {size of dataframe/default_blocksize} so please tell me how to calculate size of dataframe in spark scala Thanks in Advance. 回答1: Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can

How to calculate size of dataframe in spark scala

我怕爱的太早我们不能终老 提交于 2020-05-12 07:57:12
问题 I want to write one large sized dataframe with repartition so i want to calculate number of repartition for my source dataframe. numberofpartition= {size of dataframe/default_blocksize} so please tell me how to calculate size of dataframe in spark scala Thanks in Advance. 回答1: Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can

Queries with streaming sources must be executed with writeStream.start();

我与影子孤独终老i 提交于 2020-05-10 07:24:28
问题 I'm trying to read the messages from kafka (version 10) in spark and trying to print it. import spark.implicits._ val spark = SparkSession .builder .appName("StructuredNetworkWordCount") .config("spark.master", "local") .getOrCreate() val ds1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "topicA") .load() ds1.collect.foreach(println) ds1.writeStream .format("console") .start() ds1.printSchema() getting an error Exception in thread

spark streaming checkpoint recovery is very very slow

此生再无相见时 提交于 2020-05-10 07:23:07
问题 Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming. Situation: Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the application crashes, and we try to restart from checkpoint. The processing now takes forever and does not move forward. We tried to test out the same thing at batch interval of 1 minute, the processing runs fine and takes 1.2 minutes for batch to

How to define Kafka (data source) dependencies for Spark Streaming?

对着背影说爱祢 提交于 2020-05-08 08:11:14
问题 I'm trying to consume a kafka 0.8 topic using spark-streaming2.0.0, i'm trying to identify the required dependencies i have tried using these dependencies in my build.sbt file libraryDependencies += "org.apache.spark" %% "spark-streaming_2.11" % "2.0.0" when i run sbt package i'm getting unresolved dependencies for all three these jars, But these jars do exist https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11/2.0.0 Please help in debugging this issue, I'm new

How to do handle this use-case (running-window data) in spark

一曲冷凌霜 提交于 2020-04-18 06:10:43
问题 I am using spark-sql-2.4.1v with java 1.8. Have source data as below : val df_data = Seq( ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"), ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"), ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"), ("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"), ("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"), ("Indus_3","Indus_3_Name","Country1", "State3"