I\'ve setup a AWS instance with cassandra on it and then also setup an auto scaling group to spin up another 4-8 instances depending on alarma. But how does Cassandra know w
Another issue with using auto scaling is that, there is no instant gratification. You cannot really see the benefit of the new node till the cluster rebalances, and this could take long depending on your cluster.
While rebalance is in-progress, you end up putting additional load on the original nodes, which would defeat the purpose of adding capacity.
The best option for auto-discovery in Cassandra are seed nodes, which are 'anchor' nodes supposed to be always there when a new one shows up, and can be queried for cluster's node list every time it is needed.
So, you deliver every node with a list of seed nodes in its config file (including the seeds themselves), and once it goes up, it will get the nodes list from a seed. This, off course, demands seed nodes to be static and always running (off course, for redundancy, you must have more than just one seed node). Cassandra demands it to be listed by their IP as well (to avoid having problems with DNS).
Nonetheless, I don't think auto-scaling Cassandra would be a good thing. Cassandra partitions its data (rows) across nodes, and every time you add or remove a node, it needs to repartition and redistribute rows, which, depending on how big are you data, takes quite long (and may demand other administrative actions, like repairing, etc). Even if you have enough replicas to afford a sudden node loss (which is what WILL occur using auto-scaling), that's messy. First, because Cassandra won't automatically decomission nodes - the cluster will know the node is unavailable, but it just waits for it to come back, and try to keep the cluster as healthy as possible (including a mechanism that saves the writes to the unavailable node in other nodes for some period).
So, you would need to watch your nodes and manage those ups and downs from outside. And, you may not even have time for decomission one node and set everything (your data) in place again before another one comes up, and down again, and all that could really screw your cluster totally up.
Well, maybe there's some people out there doing this, but according to my knowledge and experience with Cassandra, it's not so simple and magic as that to be auto-scaled like you would do with a web application, and you would probably end up losing data and having a very inconsistent and unstable system.