问题
I am new to Cassandra was going though this Article explaining sharding and replication and I am stuck at a point that is -
I have a cluster with 6 Cassandra nodes configured at my local machine. I create a new keyspace "TestKeySpace" with replication factor as 6 and a table in keyspace "employee" and primary key is auto-increment-number named RID. I am not able to understand how this data will be partitioned and replicated. What I want to know is since I am keeping my replication factor to be 6, and data will be distributed on multiple nodes, then will each node will be having exactly same data as the other nodes or not?
What If my cluster has following configuration -
Number of nodes - 6 (n1, n2 ,n3, n4, n5 and n6).
replication_factor - 3.
How can I determine that for any one node (let say n1), on which other two nodes the data is replicated and which other nodes are behaving as different shards.
Thanks in Advance.
Regards, Vibhav
PS - If anybody down votes this question kindly do mention in comments what went wrong.
回答1:
I will explain this with simple example. A keyspace in cassandra is equivalent to database schema name in RDBMS.
First create a keyspace -
CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 3
};
Lets create a simple table -
CREATE TABLE USER_BY_USERID(
userid int,
name text,
email text,
PRIMARY KEY(userid, name)
) WITH CLUSTERING ORDER BY(name DESC);
In this example, userid
is your partition key and name is clustering key. Partition is also called row key, this key determines on which node row will be saved.
Your first question -
I am not able to understand how this data will be partitioned?
Data will be partitioned based on your partition key. By default C* uses Murmur3partitioner
. You can change the partitioner in cassandra.yaml configuration file. How partitions happens, is also depends on your configuration. You can specify range of tokens for each node, for example take a look at below cassandra.yaml configuration file. I have specified 6 node form your question.
cassandra.yaml for Node 0:
cluster_name: 'MyCluster'
initial_token: 0
seed_provider:
- seeds: "198.211.xxx.0"
listen_address: 198.211.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
cassandra.yaml for Node 1:
cluster_name: 'MyCluster'
initial_token: 3074457345618258602
seed_provider:
- seeds: "198.211.xxx.0"
listen_address: 192.241.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
cassandra.yaml for Node 2:
cluster_name: 'MyCluster'
initial_token: 6148914691236517205
seed_provider:
- seeds: "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
.......Node3 ...... Node4 ....
cassandra.yaml for Node 5:
cluster_name: 'MyCluster'
initial_token: {some large number}
seed_provider:
- seeds: "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch
lets take this insert statement -
INSERT INTO USER_BY_USERID VALUES(
1,
"Darth Veder",
"darthveder@star-wars.com"
);
Partitioner will calculate the hash of the PARTITION key (in above example userid - 1), and decides which node this row will be saved. Lets say calculated hash is something 12345, this row will be saved at Node 0 (look for the initial_token value for Node0 in above configuration).
Complete cassandra.yaml configuration configCassandra_yaml_r
You can go through this deployCalcTokens to know how to generate tokens.
Second question -
how data gets replicated?
Depending on your replication strategy and replication factor, the data gets replicated on each node. you have to specify Replication factor and replication strategy while creating keyspace.
For example, in above example, I have used SimpleStrategy
as replication strategy. This strategy is suitable for small cluster. For geologically distributed application you can use NetworkTopologyStrategy
. replication_factor specifies, how many copies of a row to be created, in this example three copies of each row will be created. With simple strategy, cassandra will use clockwise direction to copy the row.
In above example, the row is saved at Node0 and the same node gets copied on Node1 and Node2. Let's take another example -
INSERT INTO USER_BY_USERID VALUES(
448454,
"Obi wan kenobi",
"obiwankenobi@star-wars.com"
);
For user id 448454, the calculated hash is say 3074457345618258609, so this row will be save at Node2 (look for the initial_token value for node 2 in above configuration) and also get copied in clockwise direction to Node3 and Node4 (remember we have specified replication factor of 3, so only three copies Noe2, Node3, Node4).
Hope this helps.
来源:https://stackoverflow.com/questions/41587806/cassandra-sharding-and-replication