kafka压测之producer | 易学教程

背景

前不久自建了大数据平台，由于时间问题，排期紧张，未能对平台进行压测。现在平台搭建完成，计划对平台组件逐一进行一次压测。

欢迎指正，不喜勿喷！

压测目标

测试Kafka集群写入消息和消费消息的能力，根据测试结果评估当前Kafka集群模式的负载能力。

测试包括对Kafka写入消息和消费消息进行压力测试，根据不同量级的消息处理结果

测试方法

在服务器上使用kafka自带的测试脚本，模拟不同量级消息写入及读取请求，查看Kafka处理不同数量级的消息数时的处理能力，包括每秒生成消息数、吞吐量、消息延迟时间。

环境概况

系统环境

系统	版本	其他
centos	7.6	8C 32G
kafka	版本2.11-2.4.0	5台

测试环境

测试数据量：1亿条。

topic	batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput
test_producer	10000	1	512	none	4	3	30000
test_producer	20000	1	512	none	4	3	30000
test_producer	40000	1	512	none	4	3	30000
test_producer	60000	1	512	none	4	3	30000
test_producer	80000	1	512	none	4	3	30000

环境准备

根据当前kafka配置创建topic

kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka --partitions 4 --replication-factor 3

压测工具

使用kafka自带测试工具：kafka-producer-perf-test.sh

参数说明

参数	含义
--topic	topic名称
--num-records	生产消息条数
--payload-delimiter	payload分隔符，默认为 `\n`
--throughput	处理消息的最大吞吐量，条/秒。设置为 -1 解除限制
--producer-props	生产者相关配置，如bootstrap.servers，client.id
--producer.config	生产者配置文件（同上，二选一）
--print-metrics	在操作结束后打印metrics，默认为false
--transactional-id	每次传输的最长时间。过期后将调用commitTransaction，仅当该值为正时，才启用事务。（默认值：0）
--record-size	每条消息的大小，单位bytes（该配置与--payload-file二选一，必填）

--record-size ：可以拉取一条消息来查看大小，经统计，1条消息大小大概 473 bytes，所以在此设置为 512。

>>> import sys
>>> s = 'please answer my question'
>>> sys.getsizeof(s)
58  (单位：字节)

Producer参数说明

我们在producer涉及到性能的关键因素可能会存在如下几个：

thread:我们测试时的单机线程数；
batch-size:我们所处理的数据批次大小；
ack:主从同步策略我们在生产消息时特别需要注意，是follower收到后返回还是只是leader收到后返回，这对于我们的吞吐量影响颇大；
message-size:单条消息的大小，要在producer和broker中设置一个阈值，且它的大小范围对吞吐量也有影响；
compression-codec:压缩方式，目前我们有不压缩，gzip，snappy，lz4四种方式;
partition:分区数，主要是和线程复合来测试；
replication:副本数；
througout:我们所需要的吞吐量，单位时间内处理消息的数量，可能对我们处理消息的延迟有影响；
linger.ms：两次发送时间间隔，满足后刷一次数据。

Broker参数说明

num.replica.fetchers：副本抓取的相应参数，如果发生ISR频繁进出的情况或follower无法追上leader的情况则适当增加该值，但通常不要超过CPU核数+1；
num.io.threads：broker处理磁盘IO的线程数，主要进行磁盘io操作，高峰期可能有些io等待，因此配置需要大些。建议配置线程数量为cpu核数2倍，最大不超过3倍；
num.network.threads：broker处理消息的最大线程数，和我们生产消费的thread很类似主要处理网络io，读写缓冲区数据，基本没有io等待，建议配置线程数量为cpu核数加1；
log.flush.interval.messages：每当producer写入多少条消息时，刷数据到磁盘；
log.flush.interval.ms：每隔多长时间，刷数据到磁盘；

producer压测

1. bath-size测试

测试脚本

$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 4096  --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=10000   --throughput 30000

100000000 records sent, 29999.895000 records/sec (14.65 MB/sec), 5.69 ms avg latency, 522.00 ms max latency, 1 ms 50th, 2 ms 95th, 208 ms 99th, 349 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 4096  --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=20000   --throughput 30000

100000000 records sent, 29999.895000 records/sec (14.65 MB/sec), 6.44 ms avg latency, 637.00 ms max latency, 1 ms 50th, 2 ms 95th, 228 ms 99th, 353 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 4096  --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=40000   --throughput 30000

100000000 records sent, 29999.868001 records/sec (14.65 MB/sec), 8.12 ms avg latency, 489.00 ms max latency, 1 ms 50th, 4 ms 95th, 252 ms 99th, 354 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 4096  --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000   --throughput 30000

100000000 records sent, 29999.877001 records/sec (14.65 MB/sec), 9.03 ms avg latency, 630.00 ms max latency, 1 ms 50th, 13 ms 95th, 261 ms 99th, 357 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 4096  --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=80000   --throughput 30000

100000000 records sent, 29999.904000 records/sec (14.65 MB/sec), 9.84 ms avg latency, 531.00 ms max latency, 1 ms 50th, 34 ms 95th, 267 ms 99th, 355 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
10000	1	512	none	4	3	30000	14.65	29999	5.69
20000	1	512	none	4	3	30000	14.65	29999	6.44
40000	1	512	none	4	3	30000	14.65	29999	8.12
60000	1	512	none	4	3	30000	14.65	29999	9.03
80000	1	512	none	4	3	30000	14.65	29999	9.84

服务器负载
测试结论测试中通过我们增加batch-size的大小，我们可以发现在消息未压缩的前提下，吞吐稳定在30000条/s,而数据量在14.65M/s，平均延迟时间虽有增长，但都在10ms以内。服务器CPU在batch-size为10000时使用率5%-20%，batch-size到达20000后维持在5%-15%。

1. ack测试

测试脚本

$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 512 --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=0 --throughput 30000

100000000 records sent, 29999.877001 records/sec (14.65 MB/sec), 3.47 ms avg latency, 456.00 ms max latency, 0 ms 50th, 1 ms 95th, 150 ms 99th, 278 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 512 --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=1 --throughput 30000

100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 6.48 ms avg latency, 488.00 ms max latency, 1 ms 50th, 2 ms 95th, 226 ms 99th, 349 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 512 --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000

100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 35.50 ms avg latency, 939.00 ms max latency, 2 ms 50th, 308 ms 95th, 631 ms 99th, 763 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
20000	0	512	none	4	3	30000	14.65	29999	3.47
20000	1	512	none	4	3	30000	14.65	29999	6.48
20000	-1	512	none	4	3	30000	14.65	29999	35

服务器负载
测试结论

在当前配置下，不同的ack策略，在消息未压缩的前提下，ack=0时效率最高，安全性最低；ack=-1（默认）时效率最低，安全性最高；相比之下ack=1时安全性和性能都较高。kafka ack机制简述

3. message-size测试

测试脚本

$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 512 --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000

100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 34.43 ms avg latency, 913.00 ms max latency, 2 ms 50th, 298 ms 95th, 623 ms 99th, 755 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka --num-records 100000000 --record-size 386 --producer-props   bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000

100000000 records sent, 29999.913000 records/sec (11.04 MB/sec), 28.42 ms avg latency, 802.00 ms max latency, 2 ms 50th, 252 ms 95th, 527 ms 99th, 631 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
20000	-1	512	none	4	3	30000	14.65	29999	34.43
20000	-1	386	none	4	3	30000	11.04	29999	28.42

服务器负载
测试结论消息体大小相差128bytes的情况下，latency相差6ms，服务器负载差异不大。

4. partition

测试脚本：

# 1、创建topic
$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf1 --partitions 1 --replication-factor 1

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf3 --partitions 3 --replication-factor 1

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf5 --partitions 5 --replication-factor 1

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf7 --partitions 7 --replication-factor 1

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf9 --partitions 9 --replication-factor 1

# 2、生产数据
$ kafka-producer-perf-test.sh  --topic test_kafka_perf1 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59989.261922 records/sec (29.29 MB/sec), 47.66 ms avg latency, 616.00 ms max latency, 1 ms 50th, 290 ms 95th, 349 ms 99th, 444 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf3 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59992.105039 records/sec (29.29 MB/sec), 36.14 ms avg latency, 632.00 ms max latency, 1 ms 50th, 285 ms 95th, 454 ms 99th, 529 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf5 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59994.408521 records/sec (29.29 MB/sec), 15.82 ms avg latency, 573.00 ms max latency, 1 ms 50th, 140 ms 95th, 324 ms 99th, 397 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf7 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59994.264548 records/sec (29.29 MB/sec), 16.00 ms avg latency, 731.00 ms max latency, 1 ms 50th, 139 ms 95th, 323 ms 99th, 417 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf9 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59999.448005 records/sec (29.30 MB/sec), 16.90 ms avg latency, 870.00 ms max latency, 1 ms 50th, 143 ms 95th, 336 ms 99th, 552 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
60000	1	512	none	1	1	60000	29.29	59989	47.66
60000	1	512	none	3	1	60000	29.29	59992	36.14
60000	1	512	none	5	1	60000	29.29	59994	15.82
60000	1	512	none	7	1	60000	29.29	59994	16.00
60000	1	512	none	9	1	60000	29.30	59999	16.90

服务器负载
测试结论

broker为5，当partition数量等于broker数时，吞吐达到最优，后续随partition数量增加，基本稳定。

5. replication

测试脚本

1、创建topic
$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf11 --partitions 4 --replication-factor 1

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf22 --partitions 4 --replication-factor 2

$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf33 --partitions 4 --replication-factor 3

2、生产数据
$ kafka-producer-perf-test.sh  --topic test_kafka_perf11 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59999.556003 records/sec (29.30 MB/sec), 22.97 ms avg latency, 649.00 ms max latency, 1 ms 50th, 213 ms 95th, 359 ms 99th, 416 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf22 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59999.664002 records/sec (29.30 MB/sec), 20.35 ms avg latency, 680.00 ms max latency, 1 ms 50th, 187 ms 95th, 338 ms 99th, 416 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_kafka_perf33 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=60000 acks=1 --throughput 60000

100000000 records sent, 59999.628002 records/sec (29.30 MB/sec), 24.17 ms avg latency, 651.00 ms max latency, 1 ms 50th, 214 ms 95th, 392 ms 99th, 525 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
80000	1	512	none	4	1	30000	29.30	59999	22.97
80000	1	512	none	4	2	30000	29.30	59999	20.35
80000	1	512	none	4	3	30000	29.30	59999	24.17

服务器负载
测试结论 Replication是我们对不同partition所做的副本，一般建议在2~4为宜，我们设置为3个，既能保障数据的高可用，又避免了浪费过多的存储资源。

6. throughout

测试脚本

# 创建topic
$ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_throughout --partitions 4 --replication-factor 3

# 生产数据
$ kafka-producer-perf-test.sh  --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=1000000 acks=1 --throughput 100000

100000000 records sent, 99998.700017 records/sec (48.83 MB/sec), 78.71 ms avg latency, 614.00 ms max latency, 1 ms 50th, 337 ms 95th, 390 ms 99th, 483 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=1000000 acks=1 --throughput 200000

100000000 records sent, 199994.800135 records/sec (97.65 MB/sec), 171.74 ms avg latency, 771.00 ms max latency, 152 ms 50th, 417 ms 95th, 506 ms 99th, 602 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=1000000 acks=1 --throughput 400000

100000000 records sent, 371560.740892 records/sec (181.43 MB/sec), 160.21 ms avg latency, 684.00 ms max latency, 115 ms 50th, 443 ms 95th, 544 ms 99th, 597 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=1000000 acks=1 --throughput 600000

100000000 records sent, 370573.499548 records/sec (180.94 MB/sec), 160.82 ms avg latency, 743.00 ms max latency, 117 ms 50th, 429 ms 95th, 530 ms 99th, 581 ms 99.9th.


$ kafka-producer-perf-test.sh  --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092  batch.size=1000000 acks=1 --throughput 800000

100000000 records sent, 365582.592419 records/sec (178.51 MB/sec), 163.78 ms avg latency, 665.00 ms max latency, 123 ms 50th, 435 ms 95th, 540 ms 99th, 616 ms 99.9th.

测试结果

batch-size	ack	message-size(bytes)	compression-codec	partition	replication	throughput	MB/s	MsgNum/s	avg latency(ms)
1000000	1	512	none	4	3	100000	48.83	99998	78.71
1000000	1	512	none	4	3	200000	97.65	199994	171.74
1000000	1	512	none	4	3	400000	181.43	371560	160.21
1000000	1	512	none	4	3	600000	180.94	370573	160.82
1000000	1	512	none	4	3	800000	178.51	365582	163.78