spark-streaming

pyspark : ml + streaming

喜你入骨 提交于 2019-12-23 05:17:20
问题 According to Combining Spark Streaming + MLlib it is possible to make a prediction over a stream of input in spark. The issue with the given example (which works on my cluster) is that the testData is a given right on the correct format. I am trying to set up a client <-> server tcp exchange based on strings of data. I can't figure out how to transform the string on the correct format. while this works : sep = ";" str_recue = '0.0;0.1;0.2;0.3;0.4;0.5' rdd = sc.parallelize([str_recue]) chemin

Spark Streaming not performing operations on read blocks

丶灬走出姿态 提交于 2019-12-23 04:47:11
问题 I am newbie to Spark Streaming concept and have been stuck from last two days trying to understand Spark streaming from socket. I see Spark is able to read blocks passed to the socket. However it does not perform any operation on the read blocks. Here is the Spark code package foo; import java.io.File; import java.util.Arrays; import java.util.LinkedList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java

Scala Spark streaming fileStream

纵饮孤独 提交于 2019-12-23 02:41:13
问题 Similar to this question I'm trying to use fileStream but receiving a compile-time error about the type arguments. I'm trying to ingest XML data using org.apache.mahout.text.wikipedia.XmlInputFormat provided by mahout-examples as my InputFormat type. val fileStream = ssc.fileStream[LongWritable, Text, XmlInputFormat](WATCHDIR) The compilation errors are: Error:(39, 26) type arguments [org.apache.hadoop.io.LongWritable,scala.xml.Text,org.apache.mahout.text.wikipedia.XmlInputFormat] conform to

Two node DSE spark cluster error setting up second node. Why?

巧了我就是萌 提交于 2019-12-23 01:52:27
问题 I have DSE spark cluster with 2 nodes. One DSE analytics node with spark cannot start after I install it. Without spark it starts just fine. But on my other node spark is enabled and it can start and works just fine. Why is that and how can I solve that? Thanks. Here is my error log: ERROR [main] 2016-02-27 20:35:43,353 CassandraDaemon.java:294 - Fatal exception during initialization org.apache.cassandra.exceptions.ConfigurationException: Cannot start node if snitch's data center (Analytics)

Spark batch-streaming of kafka into single file

£可爱£侵袭症+ 提交于 2019-12-23 01:42:13
问题 I am streaming data from Kafka using batch streaming (maxRatePerPartition 10.000). So in each batch I process 10.000 kafka messages. Within this batch run I process each message by creating a dataFrame out of the rdd. After processing, I save each processed record to the same file using: dataFrame.write.mode(SaveMode.append). So it appends all messages to the same file. This is ok as long as it is running within one batch run. But after the next batch run is executed (next 10.000 messages are

Spark Streaming - obtain batch-level performance stats

邮差的信 提交于 2019-12-23 01:05:01
问题 I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program is written in Scala Questions The Spark monitoring REST API description lists the various endpoints available. However, I couldn't find endpoints that expose batch-level info. Is there a way to get a list of all the Spark batches that have been run

Task Not Serializable exception when trying to write a rdd of type Generic Record

别说谁变了你拦得住时间么 提交于 2019-12-23 00:32:20
问题 val file = File.createTempFile("temp", ".avro") val schema = new Schema.Parser().parse(st) val datumWriter = new GenericDatumWriter[GenericData.Record](schema) val dataFileWriter = new DataFileWriter[GenericData.Record](datumWriter) dataFileWriter.create(schema , file) rdd.foreach(r => { dataFileWriter.append(r) }) dataFileWriter.close() I have a DStream of type GenericData.Record which I am trying to write to HDFS in the Avro format but I'm getting this Task Not Serializable error: org

Convert a DStream to JavaDStream

我与影子孤独终老i 提交于 2019-12-22 18:19:10
问题 I know we have an option for RDD : JavaRDD<String> javaRDD = coreRdd.toJavaRDD();` is it possible to convert Dstream to JavaDStream? 回答1: Yes, you can use the static JavaDStream<T>.fromDStream: JavaDStream<String> javaDStream = JavaDStream$.MODULE$.fromDStream(dStream, scala.reflect.ClassTag$.MODULE$.apply(String.class)) Another option would be to use the class constructor, which takes an existing DStream : JavaDStream<String> javaDStream = new JavaDStream<String>(dStream, scala.reflect

_spark_metadata causing problems

社会主义新天地 提交于 2019-12-22 13:54:06
问题 I am using Spark with Scala and I have a directory where I have multiple files. In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming. And Spark streaming generates a directory _spark_metadata . The problem I am facing is when I read the directory with Spark ( sparksession.read.load ), it reads only the data generated by Spark streaming, like if the other data does not exist. Does someone know how to resolve this issue, I think there should be

Not Serializable exception when integrating Spark SQL and Spark Streaming

点点圈 提交于 2019-12-22 12:27:39
问题 This is my source code where in Im getting some data from the server side, which keeps on generating a stream of data. And then for each RDD , I'm applying the SQL Schema, and once this table is created Im trying to select something from this DStream. List<String> males = new ArrayList<String>(); JavaDStream<String> data = streamingContext.socketTextStream("localhost", (port)); data.print(); System.out.println("Socket connection established to read data from Subscriber Server"); JavaDStream