MAP CSV userdata to separate CSV

不羁的心 提交于 2019-12-11 16:15:03

问题


I have inherited a bit of a mess. I have multiple CSV Files with different user data. I need to find a way to compile all of the information together into one file, and I don't want to spend hours doing it. The problem is that not all of the users are the same, and they are not in the same order. Is there an easy way to pull fields from a second file to another where the username matches one in the first file? I am sure I'm not describing this correctly, just starting out.

For example: File 1

username,first,last,phone number
john.do,John,Doe,8888675309
jack.jo,Jack,Johnson,5378984687
harry.po,Harry,Potter,9876543219

File 2

username,first,last,email
john.do,John,Doe,john.squidwork@yahoo.com
sandy.mi,Sandy,Michaels,sandy.mi@hotelcalifornia.com    
jack.jo,Jack,Johnson,bubbletoes@jackjohnson.net
harry.po,Harry,Potter,iluvmuggles@diagonalley.com

回答1:


Take it as you will, this should combine multiple CSV files. Note that it may not be fast, but it should be thorough.

$CSVList = 'C:\Path\To\Users1.csv','C:\Path\To\Users2.csv','C:\Path\To\Users3.csv','C:\Path\To\Users4.csv','C:\Path\To\Users5.csv'
$PrimaryTable = @{}
Import-CSV $CSVList[0] | %{$PrimaryTable.Add($_.UserID,$_)}
$PrimaryKeys = $PrimaryTable.Values[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
ForEach($CSVFile in ($CSVList|Select -Skip 1)){
    $Users = Import-CSV $CSVFile
    $Keys = $Users[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
    $KeysToAdd = @{}
    $Keys|?{$_ -notin $PrimaryKeys}|%{$KeysToAdd.Add($_,"")}
    $PrimaryTable.Values|%{$_|Add-Member -NotePropertyMembers $KeysToAdd}
    ForEach($User in $Users){
        If(!($User.UserID -in $PrimaryTable.Keys)){
            $PrimaryKeys | ?{$_ -notin $Keys} | %{add-member -InputObject $User -NotePropertyName $_ -NotePropertyValue ""}
            $PrimaryTable.Add($User.UserID,$User)
        }Else{
            $Keys | ?{[string]::IsNullOrWhiteSpace($PrimaryTable.($User.UserID).$_)} | %{$PrimaryTable.($User.UserID).$_ = $User.$_}
        }
    }
    $PrimaryKeys = $PrimaryTable.Values[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
}

$PrimaryTable.Values|Export-CSV C:\Path\To\AllUserData.csv -NoTypeInformation

That makes a hashtable indexed off the UserID. It populates it with the data from the first CSV file. Then for each additional one it checks the differences in properties of what's in the first CSV and the current one, adds the missing properties to all the items in the main hashtable, then goes entry by entry, and if the user isn't in the main hashtable it adds them, and if they are then it fills in any blanks that it can for their properties.

Edit: Ok, so you appear to be having issues with the -notin operator. The most likely reason for that is an older version of PowerShell. My first suggestion is to update to v3 or v4 of PowerShell, but I know that's not always an option, so to make this a little more backwards compatible I've made some edits to the script that should make it work for you... I hope. I did test the above script (with updated paths in line 1, and I commented out the last line because I didn't feel like littering my hard drive with yet more files) with 3 CSV files that all have the UserID field, and each had 2 to 4 entries, and it worked exactly like I expected it to. Anyway, the edited script is:

$CSVList = 'C:\Path\To\Users1.csv','C:\Path\To\Users2.csv','C:\Path\To\Users3.csv','C:\Path\To\Users4.csv','C:\Path\To\Users5.csv'
$PrimaryTable = @{}
Import-CSV $CSVList[0] | %{$PrimaryTable.Add($_.UserID,$_)}
$PrimaryKeys = $PrimaryTable.Values[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
ForEach($CSVFile in ($CSVList|Select -Skip 1)){
    $Users = Import-CSV $CSVFile
    $Keys = $Users[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
    $KeysToAdd = @{}
    $Keys|?{$PrimaryKeys -notcontains $_}|%{$KeysToAdd.Add($_,"")}
    $PrimaryTable.Values|%{$_|Add-Member -NotePropertyMembers $KeysToAdd}
    ForEach($User in $Users){
        If(!($User.UserID -in $PrimaryTable.Keys)){
            $PrimaryKeys | ?{$Keys -notcontains $_} | %{add-member -InputObject $User -NotePropertyName $_ -NotePropertyValue ""}
            $PrimaryTable.Add($User.UserID,$User)
        }Else{
            $Keys | ?{[string]::IsNullOrWhiteSpace($PrimaryTable.($User.UserID).$_)} | %{$PrimaryTable.($User.UserID).$_ = $User.$_}
        }
    }
    $PrimaryKeys = $PrimaryTable.Values[0] | Get-Member -MemberType Properties | Select -ExpandProperty Name
}

$PrimaryTable.Values|Export-CSV C:\Path\To\AllUserData.csv -NoTypeInformation

That should do what you want, and should work in older versions of PowerShell. Let me know if you have errors with it. Again though, my recommendation is to update PowerShell if you are running v2. You will be happier in the long run than doing work arounds.




回答2:


Here is a function, you can use to group data by some key. If some group will have multiple different values for some property, then resulting object will have array with all values for that property:

function Group-Data {
    param(
        [object[]]$Property
    )
    $AllProperties=[ordered]@{}
    @(
        $input|Group-Object $Property|ForEach-Object {
            $_.Group|ForEach-Object {$Properties=@{}} {
                $_.PSObject.Properties|Where-Object Value|ForEach-Object {
                    if($Properties[$_.Name]){
                        if($Properties[$_.Name]-notcontains$_.Value){
                           $Properties[$_.Name]=@($Properties[$_.Name];$_.Value) 
                        }
                    }else{
                        $Properties[$_.Name]=$_.Value
                        $AllProperties[$_.Name]=$null
                    }
                }
            } {[PSCustomObject]$Properties}
        }
    )|Select-Object @($AllProperties.Keys)
}

Here is a function, which join arrays in properties. You need to use it, because Export-Csv does not handle arrays in properties correctly.

filter Join-Array {
    param(
        [string]$Separator=', '
    )
    $_.PSObject.Properties|Where-Object Value -is Array|ForEach-Object {
        $_.Value=$_.Value-join$Separator
    }
    $_
}

And you can use it that way:

Import-Csv File1.csv,File2.csv,File3.csv|Group-Data username|Join-Array|Export-Csv Result.csv



回答3:


Data management can be messy, especially when you inherit a mess, which is most of the time.

One of the best tools to help you manage data is a database management system, aka a DBMS. That may, however be overkill in your case. You may only need to do this operation once, until you have all the messy inherited data in one neat CSV file that you can keep up to date going forward. In that case, the learning curve for a full blown DBMS may not be worth it.

There are three relational operators that give relational databases much of their power to process data at retrieval time. These operators are restrict (formerly called select), project, and join. If you can mimic these three operators in PS, you may be able to clean up your data in PS without invoking a DBMS.

PS already has a good operator that does what restrict does. It's where-object.

PS already has a good operator that does what project does. It's group-object.

Relational join is where it gets messy. As far as I know there is no join-object in PS. However Bacon Bits provided a link to the Join-Object blog article, and this appears to be exactly what is needed if you want to create a join-object function on your own. Thanks, Bacon bits. Some of the blog article is motivational, explaining why decomposing (splitting) tables is sometimes a good thing, and then motivating the join-object for use when you want the data all in one place. If you are an SQL jockey, you already know that stuff. But learning how to do it in PS is great.



来源:https://stackoverflow.com/questions/33445372/map-csv-userdata-to-separate-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!