Using SSIS, How do I find the cities with the largest population?

后端 未结 2 602
野的像风
野的像风 2020-12-12 01:22

I have a dataflow task with information that looks something like this:

Province | City    | Population
-------------------------------
Ontario  | Toronto |          


        
相关标签:
2条回答
  • 2020-12-12 02:10

    Completely agree with @PaulStock that aggregates are best left to source systems. An aggregate in SSIS is a fully blocking component much like a sort and I've already made my argument on that point.

    But there are times when doing those operations in the source system just aren't going to work. The best I've been able to come up with is to basically double process the data. Yes, ick but I was never able to find a way to pass a column through unaffected. For Min/Max scenarios, I'd want that as an option but obviously something like a Sum would make it hard for the component to know what the "source" row it'd tie to.

    2005

    A 2005 implementation would look like this. Your performance is not going to be good, in fact a few orders of magnitude from good as you'll have all these blocking transforms in there in addition to having to reprocess your source data.

    2005 aggregate

    Merge join 2005 Merge Inner Join

    2008

    In 2008, you have the option of using the Cache Connection Manager which would help eliminate the blocking transformations, at least where it matters, but you're still going to have to pay the cost of double processing your source data.

    Drag two data flows onto the canvas. The first will populate the cache connection manager and should be where the aggregate takes place.

    Warm cache connection manager

    Now that the cache has the aggregated data in there, drop a lookup task in your main data flow and perform a lookup against the cache.

    General lookup tab

    General lookup tab

    Select the cache connection manager

    Connection manager lookup tab

    Map the appropriate columns

    Columns lookup tab

    Great success Great success 2008 data flow

    Script task

    The third approach I can think of, 2005 or 2008, is to write it your own self. As a general rule, I try to avoid the script tasks but this is a case where it probably makes sense. You will need to make it an asynchronous script transformation but simply handle your aggregations in there. More code to maintain but you can save yourself the trouble of reprocessing your source data.

    Finally, as a general caveat, I'd investigate what the impact of ties will do to your solution. For this data set, I would expect something like Guelph to suddenly swell and tie Toronto but if it did, what should the package do? Right now, both will result in 2 rows for Ontario but is that the intended behaviour? Script, of course, allows you to define what happens in the case of ties. You could probably stand the 2008 solution on its head by caching the "normal" data and using that as your lookup condition and using the aggregates to pull back just one of the ties. 2005 can probably do the same just by putting the aggregate as the left source for the merge join

    Edits

    Jason Horner had a good idea in his comment. A different approach would be to use a multicast transformation and perform the aggregation in one stream and bring it back together. I couldn't figure out how to make it work with a union all but we could use sorts and merge join much like in the above. This is probably a better approach as it saves us the trouble of reprocessing the source data.

    enter image description here

    0 讨论(0)
  • 2020-12-12 02:11

    Instead of using the Aggregate transformation, could you use a SQL query instead?

    SELECT
        p.province,
        p.city,
        p.[population]
    FROM
        temp_pop P
        JOIN ( SELECT
                province,
                [population] = MAX([POPULATION])
               FROM
                temp_pop
               GROUP BY
                province
             ) AS M ON p.province = M.province AND
                       p.[population] = M.[population]
    
    0 讨论(0)
提交回复
热议问题