问题
I'm trying to import a large set of data from one DB to another (MSSQL to MySQL). The transformation does this: gets a subset of data, check if it's an update or an insert by checking hash, map the data and insert it into MySQL DB with an API call. The subset part for the moment is strictly manual, is there a way to set Pentaho to do it for me, kind of iteration. The query I'm using to get the subset is
select t1.*
from (
select *, ROW_NUMBER() as RowNum over (order by id)
from mytable
) t1
where RowNum between @offset and @offset + @limit;
Is there a way that PDI can set the offset and reiterate the whole?
Thanks
回答1:
You can (despite the warnings) create a loop in a parent job, incrementing the offset variable each iteration in a Javascript step. I've used such a setup to consume webservices with an unknown number of results, shifting the offset each time I after get a full page and stopping when I get less.
Setting up the variables
In the job properties, define parameters Offset and Limit, so you can (re)start at any offset even invoke the job from the commandline with specific offset and limit. It can be done with a variables step too, but parameters do all the same things plus you can set defaults for testing.
Processing in the transformation
The main transformation(s) should have "pass parameter values to subtransformation" enabled, as it is by default.
Inside the transformation (see lower half of the image) you start with a Table Input that uses variable substitution, putting ${Offset} and ${Limit} where you have @offset and @limit.
The stream from Table Input then goes to processing, but also is copied to a Group By step for counting rows. Leave the group field empty and create a field that counts all rows. Check the box to always give back a result row.
Send the stream from Group By to a Set Variables step and set the NumRows variable in the scope of the parent job.
Looping back
In the main job, go from the transformations to a Simple Evaluation step to compare the NumRows variable to the Limit. If NumRows is smaller than ${Limit}, you've reached the last batch, success!
If not, proceed to a Javascript step to increment the Offset like this:
var offset = parseInt(parent_job.getVariable("Offset"),0);
var limit = parseInt(parent_job.getVariable("Limit"),0);
offset = offset + limit;
parent_job.setVariable("Offset",offset);
true;
The job flow then proceeds to the dummy step and then the transformation again, with the new offset value.
Notes
- Unlike a transformation, you can set and use a variable within the same job.
- The JS step needs "true;" as the last statement so it reports success to the job.
来源:https://stackoverflow.com/questions/58616643/pentaho-data-integration-import-large-dataset-from-db