For testing purpose, I need to simulate client for generating 100,000 messages per second and send them to kafka topic. Is there any tool or way that can help me generate these
The existing answers (e.g., kafka-producer-perf-test.sh) are useful for performance testing, but much less so when you need to generate more than just "a single stream of raw bytes". If you need, for example, to simulate more realistic data with nested structures, or generate data in multiple topics that have some relationship to each other, they are not sufficient. So if you need more than generating a bunch of raw bytes, I'd look at the alternatives below.
Update Dec 2020: As of today, I recommend the use of https://github.com/MichaelDrogalis/voluble. Some background info: The author is the product manager at Confluent for Kafka Streams and ksqlDB, and the author/developer of http://www.onyxplatform.org/.
From the Voluble README:
- Creating realistic data by integrating with Java Faker.
- Cross-topic relationships
- Populating both keys and values of records
- Making both primitive and complex/nested values
- Bounded or unbounded streams of data
- Tombstoning
Voluble ships as a Kafka connector to make it easy to scale and change serialization formats. You can use Kafka Connect through its REST API or integrated with ksqlDB. In this guide, I demonstrate using the latter, but the configuration is the same for both. I leave out Connect specific configuration like serializers and tasks that need to be configured for any connector.
Old answer (2016): I'd suggest to take a look at https://github.com/josephadler/eventsim, which will produce more "realistic" synthetic data (yeah, I am aware of the irony of what I just said :-P):
Eventsim is a program that generates event data for testing and demos. It's written in Scala, because we are big data hipsters (at least sometimes). It's designed to replicate page requests for a fake music web site (picture something like Spotify); the results look like real use data, but are totally fake. You can configure the program to create as much data as you want: data for just a few users for a few hours, or data for a huge number of users of users over many years. You can write the data to files, or pipe it out to Apache Kafka.
You can use the fake data for product development, correctness testing, demos, performance testing, training, or in any other place where a stream of real looking data is useful. You probably shouldn't use this data to research machine learning algorithms, and definitely shouldn't use it to understand how real people behave.