I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:
- What is the best way to deploy hazelcast-jet to multiple ec2 instances?
- How to ingest data from thousands of sources? The sources push data instead of being pulled.
- How to config client so that it knows where to submit the tasks?
It would be super useful if there is a comprehensive example where I can learn from.
What is the best way to deploy hazelcast-jet to multiple ec2 instances?
Download and unzip the Hazelcast Jet distribution on each machine:
$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip $ unzip hazelcast-jet-3.1.zip $ cd hazelcast-jet-3.1
Go to the
lib
directory of the unzipped distribution and download thehazelcast-aws
module:$ cd lib $ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar
Edit
bin/common.sh
to add the module to the classpath. Towards the end of the file is a lineCLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
You can duplicate this line and replace
-jet-3.1
with-aws-2.4
.Edit
config/hazelcast.xml
to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.Start the cluster with
jet-start.sh
.
How to config client so that it knows where to submit the tasks?
A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");
However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.
How to ingest data from thousands of sources? The sources push data instead of being pulled.
I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal
source to get the update events from it.
来源:https://stackoverflow.com/questions/58093358/hazelcast-jet-deployment-and-data-ingestion