spark-streaming | 易学教程

pyspark : ml + streaming

阅读更多关于 pyspark : ml + streaming

问题 According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark. The issue with the given example (which works on my cluster) is that the testData is a given right on the correct format. I am trying to set up a client <-> server tcp exchange based on strings of data. I can't figure out how to transform the string on the correct format. while this works : sep = ";" str_recue = '0.0;0.1;0.2;0.3;0.4;0.5' rdd = sc.parallelize([str_recue]) chemin

Spark Streaming not performing operations on read blocks

阅读更多关于 Spark Streaming not performing operations on read blocks

问题 I am newbie to Spark Streaming concept and have been stuck from last two days trying to understand Spark streaming from socket. I see Spark is able to read blocks passed to the socket. However it does not perform any operation on the read blocks. Here is the Spark code package foo; import java.io.File; import java.util.Arrays; import java.util.LinkedList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java

Scala Spark streaming fileStream

阅读更多关于 Scala Spark streaming fileStream

问题 Similar to this question I'm trying to use fileStream but receiving a compile-time error about the type arguments. I'm trying to ingest XML data using org.apache.mahout.text.wikipedia.XmlInputFormat provided by mahout-examples as my InputFormat type. val fileStream = ssc.fileStream[LongWritable, Text, XmlInputFormat](WATCHDIR) The compilation errors are: Error:(39, 26) type arguments [org.apache.hadoop.io.LongWritable,scala.xml.Text,org.apache.mahout.text.wikipedia.XmlInputFormat] conform to

Two node DSE spark cluster error setting up second node. Why?

阅读更多关于 Two node DSE spark cluster error setting up second node. Why?

问题 I have DSE spark cluster with 2 nodes. One DSE analytics node with spark cannot start after I install it. Without spark it starts just fine. But on my other node spark is enabled and it can start and works just fine. Why is that and how can I solve that? Thanks. Here is my error log: ERROR [main] 2016-02-27 20:35:43,353 CassandraDaemon.java:294 - Fatal exception during initialization org.apache.cassandra.exceptions.ConfigurationException: Cannot start node if snitch's data center (Analytics)

Spark batch-streaming of kafka into single file

阅读更多关于 Spark batch-streaming of kafka into single file

问题 I am streaming data from Kafka using batch streaming (maxRatePerPartition 10.000). So in each batch I process 10.000 kafka messages. Within this batch run I process each message by creating a dataFrame out of the rdd. After processing, I save each processed record to the same file using: dataFrame.write.mode(SaveMode.append). So it appends all messages to the same file. This is ok as long as it is running within one batch run. But after the next batch run is executed (next 10.000 messages are

Spark Streaming - obtain batch-level performance stats

阅读更多关于 Spark Streaming - obtain batch-level performance stats

问题 I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program is written in Scala Questions The Spark monitoring REST API description lists the various endpoints available. However, I couldn't find endpoints that expose batch-level info. Is there a way to get a list of all the Spark batches that have been run

Task Not Serializable exception when trying to write a rdd of type Generic Record

阅读更多关于 Task Not Serializable exception when trying to write a rdd of type Generic Record

问题 val file = File.createTempFile("temp", ".avro") val schema = new Schema.Parser().parse(st) val datumWriter = new GenericDatumWriter[GenericData.Record](schema) val dataFileWriter = new DataFileWriter[GenericData.Record](datumWriter) dataFileWriter.create(schema , file) rdd.foreach(r => { dataFileWriter.append(r) }) dataFileWriter.close() I have a DStream of type GenericData.Record which I am trying to write to HDFS in the Avro format but I'm getting this Task Not Serializable error: org

Convert a DStream to JavaDStream

阅读更多关于 Convert a DStream to JavaDStream

问题 I know we have an option for RDD : JavaRDD<String> javaRDD = coreRdd.toJavaRDD();` is it possible to convert Dstream to JavaDStream? 回答1: Yes, you can use the static JavaDStream<T>.fromDStream: JavaDStream<String> javaDStream = JavaDStream$.MODULE$.fromDStream(dStream, scala.reflect.ClassTag$.MODULE$.apply(String.class)) Another option would be to use the class constructor, which takes an existing DStream : JavaDStream<String> javaDStream = new JavaDStream<String>(dStream, scala.reflect

_spark_metadata causing problems

阅读更多关于 _spark_metadata causing problems

问题 I am using Spark with Scala and I have a directory where I have multiple files. In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming. And Spark streaming generates a directory _spark_metadata . The problem I am facing is when I read the directory with Spark ( sparksession.read.load ), it reads only the data generated by Spark streaming, like if the other data does not exist. Does someone know how to resolve this issue, I think there should be

Not Serializable exception when integrating Spark SQL and Spark Streaming

阅读更多关于 Not Serializable exception when integrating Spark SQL and Spark Streaming

问题 This is my source code where in Im getting some data from the server side, which keeps on generating a stream of data. And then for each RDD , I'm applying the SQL Schema, and once this table is created Im trying to select something from this DStream. List<String> males = new ArrayList<String>(); JavaDStream<String> data = streamingContext.socketTextStream("localhost", (port)); data.print(); System.out.println("Socket connection established to read data from Subscriber Server"); JavaDStream