spark-streaming | 易学教程

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

阅读更多关于 how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

问题 im using ubuntu im using spark dependency using intellij Command 'spark' not found, but can be installed with: .. (when i enter spark in shell) i have two user amine , and hadoop_amine (where hadoop hdfs is set) when i try to save a dataframe to HDFS (spark scala): procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json") i got this error Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied:

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

阅读更多关于 Dependencies for Spark-Streaming and Twiter-Streaming in SBT

问题 I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue. libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0" I'm using Spark 2.2.0. What are the correct dependencies? 回答1: The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem : As

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

阅读更多关于 Dependencies for Spark-Streaming and Twiter-Streaming in SBT

How to match an RDD[ParentClass] with RDD[Subclass] in apache spark?

阅读更多关于 How to match an RDD[ParentClass] with RDD[Subclass] in apache spark?

问题 I have to match an rdd with its types. trait Fruit case class Apple(price:Int) extends Fruit case class Mango(price:Int) extends Fruit Now a dstream of type DStream[Fruit] is coming. It is either Apple or Mango . How to perform operation based on the subclass? Something like the below (which doesn't work): dStream.foreachRDD{rdd:RDD[Fruit] => rdd match { case rdd: RDD[Apple] => //do something case rdd: RDD[Mango] => //do something case _ => println(rdd.count() + "<<<< not matched anything") }

How to calculate size of dataframe in spark scala

阅读更多关于 How to calculate size of dataframe in spark scala

问题 I want to write one large sized dataframe with repartition so i want to calculate number of repartition for my source dataframe. numberofpartition= {size of dataframe/default_blocksize} so please tell me how to calculate size of dataframe in spark scala Thanks in Advance. 回答1: Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can

How to calculate size of dataframe in spark scala

阅读更多关于 How to calculate size of dataframe in spark scala

Queries with streaming sources must be executed with writeStream.start();

阅读更多关于 Queries with streaming sources must be executed with writeStream.start();

问题 I'm trying to read the messages from kafka (version 10) in spark and trying to print it. import spark.implicits._ val spark = SparkSession .builder .appName("StructuredNetworkWordCount") .config("spark.master", "local") .getOrCreate() val ds1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "topicA") .load() ds1.collect.foreach(println) ds1.writeStream .format("console") .start() ds1.printSchema() getting an error Exception in thread

spark streaming checkpoint recovery is very very slow

阅读更多关于 spark streaming checkpoint recovery is very very slow

问题 Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming. Situation: Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the application crashes, and we try to restart from checkpoint. The processing now takes forever and does not move forward. We tried to test out the same thing at batch interval of 1 minute, the processing runs fine and takes 1.2 minutes for batch to

How to define Kafka (data source) dependencies for Spark Streaming?

阅读更多关于 How to define Kafka (data source) dependencies for Spark Streaming?

问题 I'm trying to consume a kafka 0.8 topic using spark-streaming2.0.0, i'm trying to identify the required dependencies i have tried using these dependencies in my build.sbt file libraryDependencies += "org.apache.spark" %% "spark-streaming_2.11" % "2.0.0" when i run sbt package i'm getting unresolved dependencies for all three these jars, But these jars do exist https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11/2.0.0 Please help in debugging this issue, I'm new

How to do handle this use-case (running-window data) in spark

阅读更多关于 How to do handle this use-case (running-window data) in spark

问题 I am using spark-sql-2.4.1v with java 1.8. Have source data as below : val df_data = Seq( ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"), ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"), ("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"), ("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"), ("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"), ("Indus_3","Indus_3_Name","Country1", "State3"