背景
前不久自建了大数据平台,由于时间问题,排期紧张,未能对平台进行压测。现在平台搭建完成,计划对平台组件逐一进行一次压测。
欢迎指正,不喜勿喷!
压测目标
测试Kafka集群写入消息和消费消息的能力,根据测试结果评估当前Kafka集群模式的负载能力。
测试包括对Kafka写入消息和消费消息进行压力测试,根据不同量级的消息处理结果
测试方法
在服务器上使用kafka自带的测试脚本,模拟不同量级消息写入及读取请求,查看Kafka处理不同数量级的消息数时的处理能力,包括每秒生成消息数、吞吐量、消息延迟时间。
环境概况
系统环境
系统 | 版本 | 其他 |
---|---|---|
centos | 7.6 | 8C 32G |
kafka | 版本2.11-2.4.0 | 5台 |
测试环境
测试数据量:1亿条。
topic | batch-size | ack | message-size(bytes) | compression-codec | partition | replication | throughput |
---|---|---|---|---|---|---|---|
test_producer | 10000 | 1 | 512 | none | 4 | 3 | 30000 |
test_producer | 20000 | 1 | 512 | none | 4 | 3 | 30000 |
test_producer | 40000 | 1 | 512 | none | 4 | 3 | 30000 |
test_producer | 60000 | 1 | 512 | none | 4 | 3 | 30000 |
test_producer | 80000 | 1 | 512 | none | 4 | 3 | 30000 |
环境准备
根据当前kafka配置创建topic
kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka --partitions 4 --replication-factor 3
压测工具
使用kafka自带测试工具:kafka-producer-perf-test.sh
参数说明
参数 | 含义 |
---|---|
--topic | topic名称 |
--num-records | 生产消息条数 |
--payload-delimiter | payload分隔符,默认为 \n |
--throughput | 处理消息的最大吞吐量,条/秒。设置为 -1 解除限制 |
--producer-props | 生产者相关配置,如bootstrap.servers,client.id |
--producer.config | 生产者配置文件(同上,二选一) |
--print-metrics | 在操作结束后打印metrics,默认为false |
--transactional-id | 每次传输的最长时间。过期后将调用commitTransaction,仅当该值为正时,才启用事务。 (默认值:0) |
--record-size | 每条消息的大小,单位bytes(该配置与--payload-file二选一,必填) |
--record-size
:可以拉取一条消息来查看大小,经统计,1条消息大小大概 473 bytes,所以在此设置为 512。
>>> import sys
>>> s = 'please answer my question'
>>> sys.getsizeof(s)
58 (单位:字节)
Producer参数说明
我们在producer涉及到性能的关键因素可能会存在如下几个:
- thread:我们测试时的单机线程数;
- batch-size:我们所处理的数据批次大小;
- ack:主从同步策略我们在生产消息时特别需要注意,是follower收到后返回还是只是leader收到后返回,这对于我们的吞吐量影响颇大;
- message-size:单条消息的大小,要在producer和broker中设置一个阈值,且它的大小范围对吞吐量也有影响;
- compression-codec:压缩方式,目前我们有不压缩,gzip,snappy,lz4四种方式;
- partition:分区数,主要是和线程复合来测试;
- replication:副本数;
- througout:我们所需要的吞吐量,单位时间内处理消息的数量,可能对我们处理消息的延迟有影响;
- linger.ms:两次发送时间间隔,满足后刷一次数据。
Broker参数说明
- num.replica.fetchers:副本抓取的相应参数,如果发生ISR频繁进出的情况或follower无法追上leader的情况则适当增加该值,但通常不要超过CPU核数+1;
- num.io.threads:broker处理磁盘IO的线程数,主要进行磁盘io操作,高峰期可能有些io等待,因此配置需要大些。建议配置线程数量为cpu核数2倍,最大不超过3倍;
- num.network.threads:broker处理消息的最大线程数,和我们生产消费的thread很类似主要处理网络io,读写缓冲区数据,基本没有io等待,建议配置线程数量为cpu核数加1;
- log.flush.interval.messages:每当producer写入多少条消息时,刷数据到磁盘;
- log.flush.interval.ms:每隔多长时间,刷数据到磁盘;
producer压测
1. bath-size测试
-
测试脚本
$ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 4096 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=10000 --throughput 30000 100000000 records sent, 29999.895000 records/sec (14.65 MB/sec), 5.69 ms avg latency, 522.00 ms max latency, 1 ms 50th, 2 ms 95th, 208 ms 99th, 349 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 4096 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 --throughput 30000 100000000 records sent, 29999.895000 records/sec (14.65 MB/sec), 6.44 ms avg latency, 637.00 ms max latency, 1 ms 50th, 2 ms 95th, 228 ms 99th, 353 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 4096 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=40000 --throughput 30000 100000000 records sent, 29999.868001 records/sec (14.65 MB/sec), 8.12 ms avg latency, 489.00 ms max latency, 1 ms 50th, 4 ms 95th, 252 ms 99th, 354 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 4096 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 --throughput 30000 100000000 records sent, 29999.877001 records/sec (14.65 MB/sec), 9.03 ms avg latency, 630.00 ms max latency, 1 ms 50th, 13 ms 95th, 261 ms 99th, 357 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 4096 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=80000 --throughput 30000 100000000 records sent, 29999.904000 records/sec (14.65 MB/sec), 9.84 ms avg latency, 531.00 ms max latency, 1 ms 50th, 34 ms 95th, 267 ms 99th, 355 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 10000 1 512 none 4 3 30000 14.65 29999 5.69 20000 1 512 none 4 3 30000 14.65 29999 6.44 40000 1 512 none 4 3 30000 14.65 29999 8.12 60000 1 512 none 4 3 30000 14.65 29999 9.03 80000 1 512 none 4 3 30000 14.65 29999 9.84 -
服务器负载
-
测试结论 测试中通过我们增加batch-size的大小,我们可以发现在消息未压缩的前提下,吞吐稳定在30000条/s,而数据量在14.65M/s,平均延迟时间虽有增长,但都在10ms以内。服务器CPU在batch-size为10000时使用率5%-20%,batch-size到达20000后维持在5%-15%。
1. ack测试
-
测试脚本
$ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=0 --throughput 30000 100000000 records sent, 29999.877001 records/sec (14.65 MB/sec), 3.47 ms avg latency, 456.00 ms max latency, 0 ms 50th, 1 ms 95th, 150 ms 99th, 278 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=1 --throughput 30000 100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 6.48 ms avg latency, 488.00 ms max latency, 1 ms 50th, 2 ms 95th, 226 ms 99th, 349 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000 100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 35.50 ms avg latency, 939.00 ms max latency, 2 ms 50th, 308 ms 95th, 631 ms 99th, 763 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 20000 0 512 none 4 3 30000 14.65 29999 3.47 20000 1 512 none 4 3 30000 14.65 29999 6.48 20000 -1 512 none 4 3 30000 14.65 29999 35 -
服务器负载
-
测试结论
在当前配置下,不同的ack策略,在消息未压缩的前提下,ack=0时效率最高,安全性最低;ack=-1(默认)时效率最低,安全性最高;相比之下ack=1时安全性和性能都较高。kafka ack机制简述
3. message-size测试
-
测试脚本
$ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000 100000000 records sent, 29999.886000 records/sec (14.65 MB/sec), 34.43 ms avg latency, 913.00 ms max latency, 2 ms 50th, 298 ms 95th, 623 ms 99th, 755 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka --num-records 100000000 --record-size 386 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=20000 acks=-1 --throughput 30000 100000000 records sent, 29999.913000 records/sec (11.04 MB/sec), 28.42 ms avg latency, 802.00 ms max latency, 2 ms 50th, 252 ms 95th, 527 ms 99th, 631 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 20000 -1 512 none 4 3 30000 14.65 29999 34.43 20000 -1 386 none 4 3 30000 11.04 29999 28.42 -
服务器负载
-
测试结论 消息体大小相差128bytes的情况下,latency相差6ms,服务器负载差异不大。
4. partition
-
测试脚本:
# 1、创建topic $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf1 --partitions 1 --replication-factor 1 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf3 --partitions 3 --replication-factor 1 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf5 --partitions 5 --replication-factor 1 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf7 --partitions 7 --replication-factor 1 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf9 --partitions 9 --replication-factor 1 # 2、生产数据 $ kafka-producer-perf-test.sh --topic test_kafka_perf1 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59989.261922 records/sec (29.29 MB/sec), 47.66 ms avg latency, 616.00 ms max latency, 1 ms 50th, 290 ms 95th, 349 ms 99th, 444 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf3 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59992.105039 records/sec (29.29 MB/sec), 36.14 ms avg latency, 632.00 ms max latency, 1 ms 50th, 285 ms 95th, 454 ms 99th, 529 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf5 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59994.408521 records/sec (29.29 MB/sec), 15.82 ms avg latency, 573.00 ms max latency, 1 ms 50th, 140 ms 95th, 324 ms 99th, 397 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf7 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59994.264548 records/sec (29.29 MB/sec), 16.00 ms avg latency, 731.00 ms max latency, 1 ms 50th, 139 ms 95th, 323 ms 99th, 417 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf9 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59999.448005 records/sec (29.30 MB/sec), 16.90 ms avg latency, 870.00 ms max latency, 1 ms 50th, 143 ms 95th, 336 ms 99th, 552 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 60000 1 512 none 1 1 60000 29.29 59989 47.66 60000 1 512 none 3 1 60000 29.29 59992 36.14 60000 1 512 none 5 1 60000 29.29 59994 15.82 60000 1 512 none 7 1 60000 29.29 59994 16.00 60000 1 512 none 9 1 60000 29.30 59999 16.90 -
服务器负载
-
测试结论
broker为5,当partition数量等于broker数时,吞吐达到最优,后续随partition数量增加,基本稳定。
5. replication
-
测试脚本
1、创建topic $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf11 --partitions 4 --replication-factor 1 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf22 --partitions 4 --replication-factor 2 $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_kafka_perf33 --partitions 4 --replication-factor 3 2、生产数据 $ kafka-producer-perf-test.sh --topic test_kafka_perf11 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59999.556003 records/sec (29.30 MB/sec), 22.97 ms avg latency, 649.00 ms max latency, 1 ms 50th, 213 ms 95th, 359 ms 99th, 416 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf22 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59999.664002 records/sec (29.30 MB/sec), 20.35 ms avg latency, 680.00 ms max latency, 1 ms 50th, 187 ms 95th, 338 ms 99th, 416 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_kafka_perf33 --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=60000 acks=1 --throughput 60000 100000000 records sent, 59999.628002 records/sec (29.30 MB/sec), 24.17 ms avg latency, 651.00 ms max latency, 1 ms 50th, 214 ms 95th, 392 ms 99th, 525 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 80000 1 512 none 4 1 30000 29.30 59999 22.97 80000 1 512 none 4 2 30000 29.30 59999 20.35 80000 1 512 none 4 3 30000 29.30 59999 24.17 -
服务器负载
-
测试结论 Replication是我们对不同partition所做的副本,一般建议在2~4为宜,我们设置为3个,既能保障数据的高可用,又避免了浪费过多的存储资源。
6. throughout
-
测试脚本
# 创建topic $ kafka-topics.sh --create --zookeeper tvm11:2181,tvm12:2181,tvm13:2181 --topic test_throughout --partitions 4 --replication-factor 3 # 生产数据 $ kafka-producer-perf-test.sh --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=1000000 acks=1 --throughput 100000 100000000 records sent, 99998.700017 records/sec (48.83 MB/sec), 78.71 ms avg latency, 614.00 ms max latency, 1 ms 50th, 337 ms 95th, 390 ms 99th, 483 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=1000000 acks=1 --throughput 200000 100000000 records sent, 199994.800135 records/sec (97.65 MB/sec), 171.74 ms avg latency, 771.00 ms max latency, 152 ms 50th, 417 ms 95th, 506 ms 99th, 602 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=1000000 acks=1 --throughput 400000 100000000 records sent, 371560.740892 records/sec (181.43 MB/sec), 160.21 ms avg latency, 684.00 ms max latency, 115 ms 50th, 443 ms 95th, 544 ms 99th, 597 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=1000000 acks=1 --throughput 600000 100000000 records sent, 370573.499548 records/sec (180.94 MB/sec), 160.82 ms avg latency, 743.00 ms max latency, 117 ms 50th, 429 ms 95th, 530 ms 99th, 581 ms 99.9th. $ kafka-producer-perf-test.sh --topic test_throughout --num-records 100000000 --record-size 512 --producer-props bootstrap.servers=tvm11:9092,tvm12:9092,tvm13:9092,tvm14:9092,tvm15:9092 batch.size=1000000 acks=1 --throughput 800000 100000000 records sent, 365582.592419 records/sec (178.51 MB/sec), 163.78 ms avg latency, 665.00 ms max latency, 123 ms 50th, 435 ms 95th, 540 ms 99th, 616 ms 99.9th.
-
测试结果
batch-size ack message-size(bytes) compression-codec partition replication throughput MB/s MsgNum/s avg latency(ms) 1000000 1 512 none 4 3 100000 48.83 99998 78.71 1000000 1 512 none 4 3 200000 97.65 199994 171.74 1000000 1 512 none 4 3 400000 181.43 371560 160.21 1000000 1 512 none 4 3 600000 180.94 370573 160.82 1000000 1 512 none 4 3 800000 178.51 365582 163.78 -
服务器负载
-
测试结论 在partition=4,replicator=3时,并发40w以下时,随着并发数增大,吞吐上升,但是在40w以后时,可以看出并发增大反而吞吐降低了,这是因为IO的限制,在高并发的情况下,产生了阻塞而导致。
结论
producer,在主从同步选取ack=1时性能和稳定性适中,批次大小我们可以选择100w左右,数据大小保持2k/条(测试数据512bytes/条),并发可达90w;分区数在3-5个,副本数为3个既可以保证性能也能维持高可用。
参考
来源:oschina
链接:https://my.oschina.net/adailinux/blog/3217840