问题
I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried: binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
回答1:
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
回答2:
I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>)
API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>)
. The solution, after reading what binaryFiles()
does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream
objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.
来源:https://stackoverflow.com/questions/38091728/spark-streaming-processing-binary-data-file