Optimal way of creating a cache in the PySpark environment
问题 I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example - Incoming Message: {"id" : 123} Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"} My code for the driver class is as follows: from Sample.Job import EnrichmentJob from Sample.Job import FunctionJob import pyspark from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkContext, SparkConf, SQLContext from pyspark.streaming import StreamingContext from pyspark