Can anyone tell me what is the use of --split-by and boundary query in sqoop?
sqoop import --connect jdbc:mysql://localhost/my --username user --passw
--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.
Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.
--boundary-query : By default sqoop will use query select min(), max() from to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Reason to use : If --split-by is not giving you the optimal performance you can use this to improve the performance further.
--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.
$CONDITIONS is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set. If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop allows you to import data in parallel and --split-by and --boundary-query allow you more control. If you're just importing a table then it'll use the PRIMARY KEY however if you're doing a more advanced query, you'll need to specify the column to do the parallel split.
i.e.,
sqoop import \
--connect 'jdbc:mysql://.../...' \
--direct \
--username uname --password pword \
--hive-import \
--hive-table query_import \
--boundary-query 'SELECT 0, MAX(id) FROM a' \
--query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
--num-mappers 3
--split-by a.id \
--target-dir /data/import \
--verbose
Boundary Query lets you specify an optimized query to get the max, min. else it will attempt to do MIN(a.id), MAX(a.id) ON your --query statement.
The results will be (if min=0, max=30) is 3 queries that get run in parallel:
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;
Also if we specify --query
value within double quotes(" "), we need to precede $CONDITIONS
with a slash(\)
--query "select * from table where id=5 AND \$CONDITIONS"
or else
--query 'select * from table where id=5 AND $CONDITIONS'
Split by :
In short: Used for partitioning of data to support parallelism and improve performance