In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.
For example, data set 1:
A B XY ZY
- -
I agree with @Matt. Use a hashtable -- something like the below. This should run in m + 2n
rather than mn
time.
Timings on my system
original Solution above
#10 TotalSeconds : 0.07788
#100 TotalSeconds : 0.37937
#1000 TotalSeconds : 5.25092
#10000 TotalSeconds : 242.82018
#20000 TotalSeconds : 906.01584
This definitely looks O(n^2)
Solution Below
#10 TotalSeconds : 0.094
#100 TotalSeconds : 0.425
#1000 TotalSeconds : 3.757
#10000 TotalSeconds : 45.652
#20000 TotalSeconds : 92.918
This looks linear.
Solution
I used three techniques to increase the speed:
--
function Get-Hash{
param(
[Parameter(Mandatory=$true)]
[object]$InputObject,
[Parameter()]
[string[]]$Properties
)
$arr = [System.Collections.ArrayList]::new()
foreach($p in $Properties) { $arr += $InputObject.$($p) }
return ( $arr -join ':' )
}
function Merge-Objects{
param(
[Parameter(Mandatory=$true)]
[object[]]$Dataset1,
[Parameter(Mandatory=$true)]
[object[]]$Dataset2,
[Parameter()]
[string[]]$Properties
)
$results = [System.Collections.ArrayList]::new()
$ds1props = $Dataset1 | gm -MemberType Properties
$ds2props = $Dataset2 | gm -MemberType Properties
$ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
$ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
$hash = @{}
$Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }
foreach ($row in $dataset1) {
$key = Get-Hash $row $Properties
$tempObject = $row.PSObject.Copy()
if ($hash.containskey($key)) {
$r2 = $hash[$key]
$hash.remove($key)
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
}
} else {
$ds2propsNotInDs1Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
}
[void]$results.Add($tempObject)
}
foreach ($row in $hash.values ) {
# add missing dataset2 objects and extend
$tempObject = $row.PSObject.Copy()
$ds1propsNotInDs2Props | % {
$tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
}
[void]$results.Add($tempObject)
}
$results
}
########
$dsLength = 10000
$dataset1 = 0..$dsLength | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}
Measure-Command -Expression {
$data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B"
}
There has been a lot of doubts for me to incorporate a binary search (a hash table) into my Join-Object cmdlet (see also: In Powershell, what's the best way to join two tables into one?) as there are a few issues to overcome that are conveniently left out from the example in the question.
Unfortunately, I can't compete with the performance of @mhhollomon solution:
dsLength Steve1 Steve2 mhhollomon Join-Object
-------- ------ ------ ---------- -----------
10 19 129 21 50
100 145 915 158 329
1000 2936 9646 1575 3355
5000 56129 69558 5814 12653
10000 183813 95472 14740 25730
20000 761450 265061 36822 80644
But I think that I can add some value:
Hash keys are strings, which means that you need to cast the related properties to strings, which is a little questionable simple because:
$Left -eq $Right ≠ "$Left" -eq "$Right"
In most cases it will work, especially when source is a .csv
file, but it might go wrong, e.g. in case the data comes from a cmdlet where $Null
does mean something else then a empty string (''
). Therefore, I recommend to explicitly define $Null
keys, e.g. with a Control character.
And as properties values could easily contain a colon (:
), I would also recommend to use a control character for separating (joining) multiple keys.
There is another pitfall by using a hash table which actually doesn't have to be an issue: what if the left ($dataset1
) and/or right ($dataset2
) have multiple matches. Take e.g. the following data sets:
$dataset1 =
ConvertFrom-SourceTable '
A B XY ZY
- - -- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
4 val4 foo4a bar4a
5 val5 foo5 bar5
6 val6 foo6 bar6
'
$dataset2 =
ConvertFrom-SourceTable '
A B ABC GH
- - --- --
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
5 val5 foo5a bar5a
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
'
In this case, I would expect a similar outcome a the SQL join and no Item has already been added. Key in dictionary
error:
$Dataset1 | FullJoin $dataset2 -On A, B | Format-Table
A B XY ZY ABC GH
- - -- -- --- --
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
4 val4 foo4a bar4a foo4 bar4
5 val5 foo5 bar5 foo5 bar5
5 val5 foo5 bar5 foo5a bar5a
6 val6 foo6 bar6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8
As you might have figured out, there is no reason to put both sides in a hash table, but you might consider to stream the left side (rather than choking the input). In the example of the question, both datasets are loaded directly into memory which is hardly a used case. It is more common that your data comes from somewhere else e.g. remote from active directory were you might be able to concurrently search each incoming object in the hash table before the next object comes in. The same counts for the following cmdlet: it might directly start processing the output and doesn't have to wait till your cmdlet is finished (note that the data is immediately released from the Join-Object
cmdlet when it is ready). In such a case measuring the performance using Measure-Command
requires a complete different approach...
See also: Computer Programming: Is the PowerShell pipeline sequential mode more memory efficient? Why or why not?