apache-spark-2.0

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

↘锁芯ラ 提交于 2019-11-28 23:53:49
I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that means. I looked at examples such as this example but my Avro object in Kafka is quit complex and cannot be

Spark parquet partitioning : Large number of files

自作多情 提交于 2019-11-27 18:01:28
I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

冷暖自知 提交于 2019-11-27 15:07:09
问题 I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that

What are the various join types in Spark?

和自甴很熟 提交于 2019-11-27 03:57:21
问题 I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semi and left_anti . What do they mean in Spark? 回答1: Here is a simple illustrative experiment: import org.apache.spark.sql._ object

Spark parquet partitioning : Large number of files

梦想与她 提交于 2019-11-26 22:36:03
问题 I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have