Which operator provides quicker output -match -contains or Where-Object for large CSV files

旧时模样 提交于 2021-01-29 10:55:28

问题


I am trying to build a logic where I have to query 4 large CSV files against 1 CSV file. Particularly finding an AD object against 4 domains and store them in variable for attribute comparison.

I have tried importing all files in different variables and used below 3 different codes to get the desired output. But it takes longer time for completion than expected.

CSV import:

$AllMainFile = Import-csv c:\AllData.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com


$AAA = Import-csv c:\AAA.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com
Z004,JKL,Test.com

$BBB = Import-Csv C:\BBB.csv
$CCC = Import-Csv C:\CCC.csv
$DDD = Import-Csv c:\DDD.csv

Sample code 1:

foreach ($x in $AllMainFile) {
    $AAAoutput += $AAA | ? {$_.employeeNumber -eq $x.employeeNumber}
    $BBBoutput += $BBB | ? {$_.employeeNumber -eq $x.employeeNumber}
    $CCCoutput += $CCC | ? {$_.employeeNumber -eq $x.employeeNumber}
    $DDDoutput += $DDD | ? {$_.employeeNumber -eq $x.employeeNumber}

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

Sample code 2 (just replacing with -match instead of Where-Object):

foreach ($x in $AllMainFile) {
    $AAAoutput += $AAA -match $x.EmployeeNumber
    $BBBoutput += $BBB -match $x.EmployeeNumber
    $CCCoutput += $CCC -match $x.EmployeeNumber
    $DDDoutput += $AllMainFile -match $x.EmployeeNumber

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

Sample code 3 (just replacing with -contains operator):

foreach ($x in $AllMainFile) {
    foreach ($c in $AAA){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$AAAoutput += $c}}
    foreach ($c in $BBB){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$BBBoutput += $c}}
    foreach ($c in $CCC){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$CCCoutput += $c}}
    foreach ($c in $DDD){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$DDDoutput += $c}}

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

I am expecting to execute the script as quick and fast as possible by comparing and lookup all 4 CSV files against 1 input file. Each files contains more than 1000k objects/rows with 5 columns.


回答1:


Performance

Before answering the question, I would like to clear some air about measuring the performance of PowerShell cmdlets. Native PowerShell is very good in streaming objects and therefore could save a lot of memory if streamed correctly (do not assign a stream to a variable or use brackets). PowerShell is also capable of invoking almost every existing .Net methods (like Add()) and technologies like LINQ.

The usual way of measuring the performance of a command is:

(Measure-Command {<myCommand>}).TotalMilliseconds

If you use this on native powershell streaming cmdlets, they appear not to perform very well in comparison with statements and dotnet commands. Often it is concluded that e.g. LINQ outperforms native PowerShell commands well over a factor hundred. The reason for this is that LINQ is reactive and using a deferred (lazy) execution: It tells it has done the job but it is actually doing it at the moment you need any result (besides it is caching a lot of results which is easiest to exclude from a benchmark by starting a new session) where of Native PowerShell is rather proactive: it passes any resolved item immediately back into the pipeline and any next cmdlet (e.g. Export-Csv) might than finalize the item and release it from memory.
In other words, if you have a slow input (see: Advocating native PowerShell) or have a large amount data to process (e.g. larger than the physical memory available), it might be better and easier to use the Native PowerShell approach.
Anyways, if you are comparing any results, you should test is in practice and test it end-to-end and not just on data that is already available in memory.

Building a list

I agree that using the Add() method on a list is much faster that using += which concatenates the new item with the current array and then reassigns it back to the array.
But again, both approaches stall the pipeline as they collect all the data in memory where you might be better off to intermediately release the result to the disk.

HashTables

You will probably find the most performance improvement in using a hash table as they are optimized for a binary search.
As it is required to compare two collections to each other, you can't stream both but as explained, it might be best and easiest you use 1 hash table for one side and compare this to each item in a stream at the other side and because you want to compare the AllData which each of the other tables, it is best to index that table into memory (in the form of a hash table).

This is how I would do this:

$Main = @{}
ForEach ($Item in $All) {
    $Main[$Item.EmployeeNumber] = @{MainName = $Item.Name; MainDomain = $Item.Domain}
}

ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
    Import-Csv "C:\$Name.csv" | Where-Object {$Main.ContainsKey($_.EmployeeNumber)} | ForEach-Object {
        [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$_.EmployeeNumber])
    } | Export-Csv "C:\Output$Name.csv"
}

Addendum

Based on the comment (and the duplicates in the lists), it appears that actually a join on all keys is requested and not just on the EmployeeNumber. For this you need to concatenate the concerned keys (separated with a separator that is not used in the data) and use that as key for the hash table.
Not in the question but from the comment it appears also that full-join is expected. For the right-join part this can be done by returning the right object in case there is no match found in the main table ($Main.ContainsKey($Key)). For the left-join part this is more complex as you will need to track ($InnerMain) which items in main are already matched and return the leftover items in the end:

$Main = @{}
$Separator = "`t"                       # Chose a separator that isn't used in any value
ForEach ($Item in $All) {
    $Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
    $Main[$Key] = @{MainEmployeeNumber = $Item.EmployeeNumber; MainName = $Item.Name; MainDomain = $Item.Domain}    # What output is expected?
}

ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
    $InnerMain = @($False) * $Main.Count
    $Index = 0
    Import-Csv "C:\$Name.csv" | ForEach-Object {
        $Key = $_.EmployeeNumber, $_.Name, $_.Domain -Join $Separator
        If ($Main.ContainsKey($Key)) {
            $InnerMain[$Index] = $True
            [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$Key])
        } Else {
            [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain; MainEmployeeNumber = $Null; MainName = $Null; MainDomain = $Null})
        }
        $Index++
    } | Export-Csv "C:\Output$Name.csv"
    $Index = 0
    ForEach ($Item in $All) {
        If (!$InnerMain[$Index]) {
            $Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
            [PSCustomObject](@{EmployeeNumber = $Null; Name = $Null; Domain = $Null} + $Main[$Key])
        }
        $Index++
    } | Export-Csv "C:\Output$Name.csv"
}

Join-Object

Just FYI, I have made a few improvements to Join-Object cmdlet (use and installation are very simple, see: In Powershell, what's the best way to join two tables into one?) including an easier changing of multiple joins which might come in handy for a request as this one. Although I still do not have full understanding of what you exactly looking for (and have minor questions like: how could the domains differ in a domain column if it is an extract from one specific domain?).
I take the general description "Particularly finding an AD object against 4 domains and store them in variable for attribute comparison" as leading. In here I presume that the $AllMainFile is actually just an intermediate table existing out of a concatenation of all concerned tables (and not really necessarily but just confusing as it might contain to types of duplicates the employeenumbers from the same domain and the employeenumbers from other domains). If this is correct, you can just omit this table using the Join-Object cmdlet:

$AAA = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,DEF,Domain2
Z003,GHI,Domain3
'@

$BBB = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,JKL,Domain2
Z004,MNO,Domain4
'@

$CCC = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z005,PQR,Domain2
Z001,ABC,Domain1
Z001,STU,Domain2
'@

$DDD = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z005,VWX,Domain4
Z006,XYZ,Domain1
Z001,ABC,Domain3
'@

$AAA | FullJoin $BBB -On EmployeeNumber -Discern AAA |
    FullJoin $CCC -On EmployeeNumber -Discern BBB |
    FullJoin $DDD -On EmployeeNumber -Discern CCC,DDD | Format-Table

Result:

EmployeeNumber AAAName AAADomain BBBName BBBDomain CCCName CCCDomain DDDName DDDDomain
-------------- ------- --------- ------- --------- ------- --------- ------- ---------
Z001           ABC     Domain1   ABC     Domain1   ABC     Domain1   ABC     Domain3
Z001           ABC     Domain1   ABC     Domain1   STU     Domain2   ABC     Domain3
Z002           DEF     Domain2   JKL     Domain2
Z003           GHI     Domain3
Z004                             MNO     Domain4
Z005                                               PQR     Domain2   VWX     Domain4
Z006                                                                 XYZ     Domain1


来源:https://stackoverflow.com/questions/58457655/which-operator-provides-quicker-output-match-contains-or-where-object-for-larg

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!