I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax\'s int
setting pig.noSplitCombination = true takes me to the other extreme end - with this flag I started having 769 map tasks
You can set cassandra.input.split.size to something less than 64k which is the default split size, so you can get more splits. How many rows per node for the Cql table? Can you post your table schema?
add split_size to the url paramaters
For CassandraStorage use the following parameters cassandra://[username:password@]/[?slice_start=&slice_end=[&reversed=true][&limit=1][&allow_deletes=true][&widerows=true][&use_secondary=true][&comparator=][&split_size=][&partitioner=][&init_address=][&rpc_port=]]
For CqlStorage use the following parameters cql://[username:password@]/[?[page_size=][&columns=][&output_query=][&where_clause=][&split_size=][&partitioner=][&use_secondary=true|false][&init_address=][&rpc_port=]]
You should set pig.noSplitCombination = true
. You can do this in one of three places.
When invoking the script:
dse pig -Dpig.noSplitCombination=true /path/to/script.pig
In the Pig script itself:
SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();
Or permanently in /etc/dse/pig/pig.properties
. Uncomment:
pig.noSplitCombination=true
Otherwise, Pig may set your total input paths (combined) to process: 1.