spark-streaming

real time log processing using apache spark streaming

淺唱寂寞╮ 提交于 2019-12-20 12:38:54
问题 I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me. 回答1: Apache Flume may help to read the logs in

Drop spark dataframe from cache

≯℡__Kan透↙ 提交于 2019-12-20 11:14:42
问题 I am using Spark 1.3.0 with python api. While transforming huge dataframes, I cache many DFs for faster execution; df1.cache() df2.cache() Once use of certain dataframe is over and is no longer needed how can I drop DF from memory (or un-cache it??)? For example, df1 is used through out the code while df2 is utilized for few transformations and after that, it is never needed. I want to forcefully drop df2 to release more memory space. 回答1: just do the following: df1.unpersist() df2.unpersist(

Spark Task Memory allocation

妖精的绣舞 提交于 2019-12-20 10:57:49
问题 I am trying to find out the best way to configure the memory on the nodes of my cluster. However I believe that for that there some things that I need to further understand such as how spark handles memory across tasks. For example, let's say I have 3 executors, each executor can run up to 8 tasks in parallel (i.e. 8 cores). If I have an RDD with 24 partitions, this means that theoretically all partitions can be processed in parallel. However if we zoom into one executor here, this assumes

Connection pooling in a streaming pyspark application

烂漫一生 提交于 2019-12-20 04:15:48
问题 ​What is the proper way of using connection pools in a streaming pyspark application ? I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application. 回答1: Long story short connection pools will be

How to pass column names in selectExpr through one or more string parameters in spark using scala?

☆樱花仙子☆ 提交于 2019-12-20 04:06:27
问题 I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting Below is the piece of code I am trying to parameterize. var filteredMicroBatchDF=microBatchOutputDF .selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" ) .groupBy("col1","col2").agg(max("otherCols")

Prepare batch statement to store all the rdd to mysql generated from spark-streaming

流过昼夜 提交于 2019-12-20 03:55:27
问题 I am trying to insert the batch RDDs generated from Dstream using spark-streaming to mysql. Following code works fine but the problem in this is i am creating one connection for storing the each tuple. So, to avoid that i created the connnection outside the foreachRDD but it gave me the following error : Code : realTimeAgg.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) { x.foreachPartition { it => val conn = DriverManager.getConnection("jdbc:mysql://IP:Port/DbName,UserName,Password) val

Print the content of streams (Spark streaming) in Windows system

廉价感情. 提交于 2019-12-20 03:33:16
问题 I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system? public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("My app") .setMaster("local[2]") .setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6") .set("spark.executor.memory", "2g"); JavaStreamingContext jssc = new

Scala fat jar dependency issue while Job submit

岁酱吖の 提交于 2019-12-20 03:04:56
问题 I have written simple kafka stream using Scala. It is working good in local. I have taken fat jar and submitted in scala cluster. I am getting class not found error after submit the job. if I extract the fat jar, it has all dependency inside the fat jar. why I am getting class not found error ?. How to solve this ? Note: if I deploy(copy) the fat jar into Spark/jars folder manually. I don't see any issue. But, it is not correct approach I am using window 7 & running master and worker node on

How to filter dstream using transform operation and external RDD?

大憨熊 提交于 2019-12-20 01:04:22
问题 I used transform method in a similar use case as described in Transform Operation section of Transformations on DStreams: spamInfoRDD = sc.pickleFile(...) # RDD containing spam information # join data stream with spam information to do data cleaning cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...)) My code is as follows: sc = SparkContext("local[4]", "myapp") ssc = StreamingContext(sc, 5) ssc.checkpoint('hdfs://localhost:9000/user/spark/checkpoint/') lines =

How to infer schema of JSON files?

不问归期 提交于 2019-12-19 12:08:45
问题 I have the following String in Java { "header": { "gtfs_realtime_version": "1.0", "incrementality": 0, "timestamp": 1528460625, "user-data": "metra" }, "entity": [{ "id": "8424", "vehicle": { "trip": { "trip_id": "UP-N_UN314_V1_D", "route_id": "UP-N", "start_time": "06:17:00", "start_date": "20180608", "schedule_relationship": 0 }, "vehicle": { "id": "8424", "label": "314" }, "position": { "latitude": 42.10085, "longitude": -87.72896 }, "current_status": 2, "timestamp": 1528460601 } } ] }