Efficiently merge large object datasets having mulitple matching keys

前端 未结 2 953
刺人心
刺人心 2021-01-27 07:29

In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.

For example, data set 1:

A B    XY   ZY  
- -          


        
相关标签:
2条回答
  • 2021-01-27 07:51

    I agree with @Matt. Use a hashtable -- something like the below. This should run in m + 2n rather than mn time.

    Timings on my system

    original Solution above

    #10    TotalSeconds      :   0.07788
    #100   TotalSeconds      :   0.37937
    #1000  TotalSeconds      :   5.25092
    #10000 TotalSeconds      : 242.82018
    #20000 TotalSeconds      : 906.01584
    

    This definitely looks O(n^2)

    Solution Below

    #10    TotalSeconds      :  0.094
    #100   TotalSeconds      :  0.425
    #1000  TotalSeconds      :  3.757
    #10000 TotalSeconds      : 45.652
    #20000 TotalSeconds      : 92.918
    

    This looks linear.

    Solution

    I used three techniques to increase the speed:

    1. Change over to a hashtable. This allows constant time lookups so that you don't have to have nested loops. This is the only change really needed to go from O(n^2) to linear time. It does have the disadvantage that there is more setup work done. So, the advantage of linear time won't be seen until the loop count is large enough to pay for the setup.
    2. Use ArrayList instead of a native array. Adding an item to a native array requires that the array be reallocated and all the items to be copied. So this is also an O(n^2) operation. Since this operation is being done at the engine level, the constant is very small, so it really won't make a difference until much later.
    3. Use PsObject.Copy to create the new object. This is a small optimization compared to the other two, but it cut the run time in half for me.

    --

    function Get-Hash{
        param(
            [Parameter(Mandatory=$true)]
            [object]$InputObject,
            [Parameter()]
            [string[]]$Properties    
        )
    
        $arr = [System.Collections.ArrayList]::new()
    
        foreach($p in $Properties) { $arr += $InputObject.$($p) }
    
        return ( $arr -join ':' )
    }
    
    function Merge-Objects{
        param(
            [Parameter(Mandatory=$true)]
            [object[]]$Dataset1,
            [Parameter(Mandatory=$true)]
            [object[]]$Dataset2,
            [Parameter()]
            [string[]]$Properties
        )
    
        $results = [System.Collections.ArrayList]::new()
    
        $ds1props = $Dataset1 | gm -MemberType Properties
        $ds2props = $Dataset2 | gm -MemberType Properties
        $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
        $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
    
    
        $hash = @{}
        $Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }
    
        foreach ($row in $dataset1) {
    
            $key = Get-Hash $row $Properties
    
            $tempObject = $row.PSObject.Copy()
    
            if ($hash.containskey($key)) {
                $r2 = $hash[$key]
    
                $hash.remove($key)
                $ds2propsNotInDs1Props | % {
                    $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
                }
    
            } else {
                $ds2propsNotInDs1Props | % {
                    $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
                }
            }
            [void]$results.Add($tempObject)
        }
    
        foreach ($row in $hash.values ) {
            # add missing dataset2 objects and extend
            $tempObject = $row.PSObject.Copy()
    
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
    
            [void]$results.Add($tempObject)
        }
    
        $results
    }
    
    ########
    
    $dsLength = 10000
    $dataset1 = 0..$dsLength | %{
        New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
    }
    $dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
        New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
    }
    
    Measure-Command -Expression {
    
        $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 
    
    }
    
    0 讨论(0)
  • 2021-01-27 07:54

    There has been a lot of doubts for me to incorporate a binary search (a hash table) into my Join-Object cmdlet (see also: In Powershell, what's the best way to join two tables into one?) as there are a few issues to overcome that are conveniently left out from the example in the question.

    Unfortunately, I can't compete with the performance of @mhhollomon solution:

    dsLength Steve1 Steve2 mhhollomon Join-Object
    -------- ------ ------ ---------- -----------
          10     19    129         21          50
         100    145    915        158         329
        1000   2936   9646       1575        3355
        5000  56129  69558       5814       12653
       10000 183813  95472      14740       25730
       20000 761450 265061      36822       80644
    

    But I think that I can add some value:

    not right

    Hash keys are strings, which means that you need to cast the related properties to strings, which is a little questionable simple because:

    $Left -eq $Right ≠ "$Left" -eq "$Right"
    

    In most cases it will work, especially when source is a .csv file, but it might go wrong, e.g. in case the data comes from a cmdlet where $Null does mean something else then a empty string (''). Therefore, I recommend to explicitly define $Null keys, e.g. with a Control character.
    And as properties values could easily contain a colon (:), I would also recommend to use a control character for separating (joining) multiple keys.

    Also right

    There is another pitfall by using a hash table which actually doesn't have to be an issue: what if the left ($dataset1) and/or right ($dataset2) have multiple matches. Take e.g. the following data sets:

    $dataset1 = ConvertFrom-SourceTable '

        A B    XY    ZY  
        - -    --    --  
        1 val1 foo1  bar1
        2 val2 foo2  bar2
        3 val3 foo3  bar3
        4 val4 foo4  bar4
        4 val4 foo4a bar4a
        5 val5 foo5  bar5
        6 val6 foo6  bar6
    '
    

    $dataset2 = ConvertFrom-SourceTable '

        A B    ABC   GH  
        - -    ---   --  
        3 val3 foo3  bar3
        4 val4 foo4  bar4
        5 val5 foo5  bar5
        5 val5 foo5a bar5a
        6 val6 foo6  bar6
        7 val7 foo7  bar7
        8 val8 foo8  bar8
    '
    

    In this case, I would expect a similar outcome a the SQL join and no Item has already been added. Key in dictionary error:

    $Dataset1 | FullJoin $dataset2 -On A, B | Format-Table
    
    A B    XY    ZY    ABC   GH
    - -    --    --    ---   --
    1 val1 foo1  bar1
    2 val2 foo2  bar2
    3 val3 foo3  bar3  foo3  bar3
    4 val4 foo4  bar4  foo4  bar4
    4 val4 foo4a bar4a foo4  bar4
    5 val5 foo5  bar5  foo5  bar5
    5 val5 foo5  bar5  foo5a bar5a
    6 val6 foo6  bar6  foo6  bar6
    7 val7             foo7  bar7
    8 val8             foo8  bar8
    

    Only Right

    As you might have figured out, there is no reason to put both sides in a hash table, but you might consider to stream the left side (rather than choking the input). In the example of the question, both datasets are loaded directly into memory which is hardly a used case. It is more common that your data comes from somewhere else e.g. remote from active directory were you might be able to concurrently search each incoming object in the hash table before the next object comes in. The same counts for the following cmdlet: it might directly start processing the output and doesn't have to wait till your cmdlet is finished (note that the data is immediately released from the Join-Object cmdlet when it is ready). In such a case measuring the performance using Measure-Command requires a complete different approach...
    See also: Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?

    0 讨论(0)
提交回复
热议问题