apache-spark-2.0 | 易学教程

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

阅读更多关于 Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that means. I looked at examples such as this example but my Avro object in Kafka is quit complex and cannot be

Spark parquet partitioning : Large number of files

阅读更多关于 Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have separate coalesce per partition. This is however doesn't look like an easy thing. I need to visit all

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

阅读更多关于 Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

问题 I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that

What are the various join types in Spark?

阅读更多关于 What are the various join types in Spark?

问题 I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at the StackOverflow answer on SQL joins and top couple of answers do not mention some of the joins from above e.g. left_semi and left_anti . What do they mean in Spark? 回答1: Here is a simple illustrative experiment: import org.apache.spark.sql._ object

Spark parquet partitioning : Large number of files

阅读更多关于 Spark parquet partitioning : Large number of files

问题 I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from the root directory. To avoid that I tried data.coalese(numPart).write.partitionBy("key").parquet("/location") This however creates numPart number of parquet files in each partition. Now my partition size is different. SO I would ideally like to have