问题
EDIT2: Finally I have made my own producer using Java and it works well, so the problem is in the Kafka-console-producer. The kafka-console-consumer works well.
EDIT: I have already tried with the version 0.9.0.1 and has the same behaviour.
I am working on my bachelor's final project, a comparison between Spark Streaming and Flink. Before both frameworks I am using Kafka and a script to generate the data (explained below). My first test is to compare the latency between both frameworks with simple workloads and Kafka is giving me a really high latency (1 second constantly). For simplicity, for the moment I am running in only one machine both Kafka and Spark.
I have already looked for and found similar problems, and tried the solutions they give but nothing changed. I have checked also all the Kafka configurations in the official documentation and put the importants for the latency in my config files, this is my configuration:
Kafka 0.10.2.1 - Spark 2.1.0
server.properties:
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=2
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=1000
log.flush.interval.ms=50
log.retention.hours=24
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
flush.messages=100
flush.ms=10
producer.properties:
compression.type=none
max.block.ms=200
linger.ms=50
batch.size=0
Spark Streaming program: (which prints the received data, and the difference between when the data was created and when is being processed for the function)
package com.tfg.spark1.spark1;
import java.util.Map;
import java.util.HashMap;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
import org.apache.spark.streaming.kafka.*;
public final class Timestamp {
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: Timestamp <topics> <numThreads>");
System.exit(1);
}
SparkConf conf = new SparkConf().setMaster("spark://192.168.0.155:7077").setAppName("Timestamp");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.milliseconds(100));
Map<String, Integer> topicMap = new HashMap<String, Integer>();
int numThreads = Integer.parseInt(args[1]);
topicMap.put(args[0], numThreads);
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(jssc, "192.168.0.155:2181", "grupo-spark", topicMap); //Map<"test", 2>
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
private static final long serialVersionUID = 1L;
public String call (Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> newLine = lines.map(new Function<String, String>() {
private static final long serialVersionUID = 1L;
public String call(String line) {
String[] tuple = line.split(" ");
String totalTime = String.valueOf(System.currentTimeMillis() - Long.valueOf(tuple[1]));
//String newLine = line.concat(" " + String.valueOf(System.currentTimeMillis()) + " " + totalTime);
return totalTime;
}
});
lines.print();
newLine.print();
jssc.start();
jssc.awaitTermination();
}
}
The generated data has this format:
"Random bits" + " " + "current time in ms"
01 1496421618634
11 1496421619044
00 1496421619451
00 1496421618836
10 1496421619247
Finally when I run my Spark Streaming program and the script generator, which generates the data every 200ms, Spark (batch interval=100ms) prints 9 empty batches, and every second (always 900ms moment, like in this example: Time: 1496421619900 ms) this results:
-------------------------------------------
Time: 1496421619900 ms
-------------------------------------------
01 1496421618634
11 1496421619044
00 1496421619451
00 1496421618836
10 1496421619247
-------------------------------------------
Time: 1496421619900 ms
-------------------------------------------
1416
1006
599
1214
803
Also if I run one Kafka command-line-producer and another command-line-consumer, it always takes some time to print the produced data in the consumer.
Thanks in advance for the help!
回答1:
I have just updated the JIRA you opened with the reason why you always see the 1000 ms delay.
https://issues.apache.org/jira/browse/KAFKA-5426
I report here the reason ...
the linger.ms parameter is set using the --timeout
option on the command line which if not specified is 1000 ms.
At same time the batch.size parameter is set using the --max-partition-memory-bytes
option on the command line which if not specified is 16384.
It means that even if you specify linger.ms and batch.size using --producer-property or --producer.config, they will be always overwritten by the above "specific" options.
来源:https://stackoverflow.com/questions/44334304/kafka-spark-streaming-constant-delay-of-1-second