Importing XML into SQL Server but trying to make multiple entries if multiple results exist for a child element

断了今生、忘了曾经 提交于 2020-08-09 09:03:48

问题


I have the following XML file that I import regularly into my SQL Server table files_index.

Here is a small sample of XML below, my table layout and the Powershell code I use to insert the data.

I use a PowerShell script from this answer

The reason I am using PowerShell is that the file is 3.5gb so too big to use bulkinsert

This solution works perfectly and has been for a while however I use the XML element Prod_ID when in the table the column Prod_ID to join my data with files supplied by suppliers, in most cases this is fine but if you look at the XML, under the file node there is a lower one called M_Prod_ID this contains variations of the Prod_ID (used by different suppliers / countries) but it refers to the product in question.

To the question at last, can anyone suggest a way of if there is an entry or entries in the M_Prod_ID child of file how to create multiple entries for the file and all of its columns into my table. as when I join the Prod_ID with other tables I have not all matches are returned as certain products have multiple SKU's and I am only collecting the one in Prod_ID under the file node.

I hope I have explained this properly if there are any suggestions on how I can get these multiple results or any other way of joining the data (I did think to create a new table and for every Prod_ID value and every M_Prod_ID value and use them to do a join) but again I'm not really sure of the best solution. please can I have any suggestions, and thank you for reading this epic post.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ICECAT-interface SYSTEM "http://data.icecat.biz/dtd/files.index.dtd">
<!-- source: Icecat.biz 2019 -->
<ICECAT-interface xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://data.icecat.biz/xsd/files.index.xsd">
   <files.index Generated="20190309013133">
      <file path="export/level4/EN/1980.xml" Product_ID="1980" Updated="20190308212809" Quality="ICECAT" Supplier_id="1" Prod_ID="CHP310" Catid="714" On_Market="1" Model_Name="CHP310" Product_View="212121" HighPic="http://images.icecat.biz/img/gallery/img_1980_high_1493356129_7496_32689.jpg" HighPicSize="2758330" HighPicWidth="4134" HighPicHeight="5433" Date_Added="20051004000000" Limited="No">
         <M_Prod_ID>CHP310?5PK</M_Prod_ID>
         <M_Prod_ID>CHP310/61623</M_Prod_ID>
         <M_Prod_ID>CHP310/BUN</M_Prod_ID>
         <EAN_UPCS>
            <EAN_UPC Value="5705965480120" IsApproved="0" />
            <EAN_UPC Value="4250786102412" IsApproved="0" />
         </EAN_UPCS>
         <Country_Markets>
            <Country_Market Value="GB" />
            <Country_Market Value="PL" />
         </Country_Markets>
      </file>
      <file path="export/level4/EN/2205.xml" Product_ID="2205" Updated="20190308073831" Quality="SUPPLIER" Supplier_id="1" Prod_ID="C6487C" Catid="234" On_Market="1" Model_Name="C6487C" Product_View="71542" HighPic="http://images.icecat.biz/img/gallery/2205_7952931385.jpg" HighPicSize="88121" HighPicWidth="573" HighPicHeight="430" Date_Added="20050627000000" Limited="No">
         <M_Prod_ID>C6487C#ABP</M_Prod_ID>
         <EAN_UPCS>
            <EAN_UPC Value="0808736280969" IsApproved="1" />
            <EAN_UPC Value="0808736340168" IsApproved="1" />
         </EAN_UPCS>
         <Country_Markets>
            <Country_Market Value="DE" />
            <Country_Market Value="AU" />
            <Country_Market Value="CH" />
            <Country_Market Value="ZA" />
         </Country_Markets>
      </file>
   </files.index>
</ICECAT-interface>

Table layout in SQL Server:

CREATE TABLE [dbo].[files_index]
(
    [Product_ID] [int] NOT NULL,
    [path] [varchar](100) NULL,
    [Updated] [varchar](50) NULL,
    [Quality] [varchar](50) NULL,
    [Supplier_id] [int] NULL,
    [Prod_ID] [varchar](MAX) NULL,
    [Catid] [int] NULL,
    [On_Market] [int] NULL,
    [Model_Name] [varchar](max) NULL,
    [Product_View] [int] NULL,
    [HighPic] [varchar](max) NULL,
    [HighPicSize] [int] NULL,
    [HighPicWidth] [int] NULL,
    [HighPicHeight] [int] NULL,
    [Date_Added] [varchar](150) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Powershell Script

Set-ExecutionPolicy Unrestricted -scope Currentuser

[String]$global:connectionString = "Data Source=Apps2\Apps2;Initial 
Catalog=ICECAT;Integrated Security=SSPI";
[System.Data.DataTable]$global:dt = New-Object System.Data.DataTable;
[System.Xml.XmlTextReader]$global:xmlReader = New-Object 
System.Xml.XmlTextReader("C:\Scripts\icecat\files.index.xml");
[Int32]$global:batchSize = 50000;

Function Add-FileRow() {
    $newRow = $dt.NewRow();
    $null = $dt.Rows.Add($newRow);
    $newRow["Product_ID"] = $global:xmlReader.GetAttribute("Product_ID");
    $newRow["path"] = $global:xmlReader.GetAttribute("path");
    $newRow["Updated"] = $global:xmlReader.GetAttribute("Updated");
    $newRow["Quality"] = $global:xmlReader.GetAttribute("Quality");
    $newRow["Supplier_id"] = $global:xmlReader.GetAttribute("Supplier_id");
    $newRow["Prod_ID"] = $global:xmlReader.GetAttribute("Prod_ID");
    $newRow["Catid"] = $global:xmlReader.GetAttribute("Catid");
    $newRow["On_Market"] = $global:xmlReader.GetAttribute("On_Market");
    $newRow["Model_Name"] = $global:xmlReader.GetAttribute("Model_Name");
    $newRow["Product_View"] = $global:xmlReader.GetAttribute("Product_View");
    $newRow["HighPic"] = $global:xmlReader.GetAttribute("HighPic");
    $newRow["HighPicSize"] = $global:xmlReader.GetAttribute("HighPicSize");
    $newRow["HighPicWidth"] = $global:xmlReader.GetAttribute("HighPicWidth");
    $newRow["HighPicHeight"] = $global:xmlReader.GetAttribute("HighPicHeight");
    $newRow["Date_Added"] = $global:xmlReader.GetAttribute("Date_Added");
}


     # init data table schema
    $da = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM files_index WHERE 0 = 1", $global:connectionString);
    $null = $da.Fill($global:dt);
    $bcp = New-Object System.Data.SqlClient.SqlBulkCopy($global:connectionString);
    $bcp.DestinationTableName = "dbo.files_index";

    $recordCount = 0;



    while($xmlReader.Read() -eq $true)
    {

        if(($xmlReader.NodeType -eq [System.Xml.XmlNodeType]::Element) -and 
($xmlReader.Name -eq "file"))
         {
            Add-FileRow -xmlReader $xmlReader;
            $recordCount += 1;
            if(($recordCount % $global:batchSize) -eq 0) 
            {
                $bcp.WriteToServer($dt);
                $dt.Rows.Clear();
                Write-Host "$recordCount file elements processed so far";
            }
        }

    }

     if($dt.Rows.Count -gt 0)
     {
         $bcp.WriteToServer($dt);
     }

    $bcp.Close();
    $xmlReader.Close();



    Write-Host "$recordCount file elements imported ";


catch
{
    throw;
}

回答1:


This should get you pretty far. It's completely untested, so please read the code, understand it, and make the appropriate changes to get it to work.

I've removed the function and inlined all the code into the loop instead, the function was too bulky for my taste. Now you should be able to see more clearly what's going on.

Effectively it's the exact same code two times, with a small extra step that adds self-references so you can query every product via its primary ID and and its secondary IDs in the same way, as discussed in the comments.

$connectionString = "Data Source=Apps2\Apps2;Initial Catalog=ICECAT;Integrated Security=SSPI"
$batchSize = 50000

# set up [files_index] datatable & read schema from DB
$files_index_table = New-Object System.Data.DataTable
$files_index_adapter = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM files_index WHERE 0 = 1", $connectionString)
$files_index_adapter.Fill($files_index_table) | Out-Null
$files_index_bcp = New-Object SqlBulkCopy($connectionString)
$files_index_bcp.DestinationTableName = "dbo.files_index"
$files_index_count = 0

# set up [product_ids] datatable & read schema from DB
$product_ids_table = New-Object System.Data.DataTable
$product_ids_adapter = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM product_ids WHERE 0 = 1", $connectionString)
$product_ids_adapter.Fill($product_ids_table) | Out-Null
$product_ids_bcp = New-Object System.Data.SqlClient.SqlBulkCopy($connectionString)
$product_ids_bcp.DestinationTableName = "dbo.product_ids"
$product_ids_count = 0

# main import loop
$xmlReader = New-Object System.Xml.XmlTextReader("C:\Scripts\icecat\files.index.xml")
while ($xmlReader.Read()) {
    # skip any XML nodes that aren't elements
    if ($xmlReader.NodeType -ne [System.Xml.XmlNodeType]::Element) { continue }

    # handle <file> elements
    if ($xmlReader.Name -eq "file") {
        $files_index_count++

        # remember current product ID, we'll need it when we hit the next <M_Prod_ID> element
        $curr_product_id = $xmlReader.GetAttribute("Product_ID")
        $is_new_file = $true

        $newRow = $files_index_table.NewRow()
        $newRow["Product_ID"] = $xmlReader.GetAttribute("Product_ID")
        $newRow["path"] = $xmlReader.GetAttribute("path")
        $newRow["Updated"] = $xmlReader.GetAttribute("Updated")
        $newRow["Quality"] = $xmlReader.GetAttribute("Quality")
        $newRow["Supplier_id"] = $xmlReader.GetAttribute("Supplier_id")
        $newRow["Prod_ID"] = $xmlReader.GetAttribute("Prod_ID")
        $newRow["Catid"] = $xmlReader.GetAttribute("Catid")
        $newRow["On_Market"] = $xmlReader.GetAttribute("On_Market")
        $newRow["Model_Name"] = $xmlReader.GetAttribute("Model_Name")
        $newRow["Product_View"] = $xmlReader.GetAttribute("Product_View")
        $newRow["HighPic"] = $xmlReader.GetAttribute("HighPic")
        $newRow["HighPicSize"] = $xmlReader.GetAttribute("HighPicSize")
        $newRow["HighPicWifiles_index_tableh"] = $xmlReader.GetAttribute("HighPicWifiles_index_tableh")
        $newRow["HighPicHeight"] = $xmlReader.GetAttribute("HighPicHeight")
        $newRow["Date_Added"] = $xmlReader.GetAttribute("Date_Added")
        $files_index_table.Rows.Add($newRow) | Out-Null

        if ($files_index_table.Rows.Count -eq $batchSize) {
            $files_index_bcp.WriteToServer($files_index_table)
            $files_index_table.Rows.Clear()
            Write-Host "$files_index_count <file> elements processed so far"
        }
    # handle <M_Prod_ID> elements
    } elseif ($xmlReader.Name -eq "M_Prod_ID") {
        $product_ids_count++

        # add self-reference row to the [product_ids] table
        # only for the first <M_Prod_ID> per <file> we need to do this
        if ($is_new_file) {
            $newRow = $product_ids_table.NewRow()
            $newRow["Product_ID"] = $curr_product_id  # from above
            $newRow["Alternative_ID"] = $curr_product_id
            $product_ids_table.Rows.Add($newRow) | Out-Null
            $is_new_file = $false
        }

        $newRow = $product_ids_table.NewRow()
        $newRow["Product_ID"] = $curr_product_id  # from above
        $newRow["Alternative_ID"] = $xmlReader.Value
        $product_ids_table.Rows.Add($newRow) | Out-Null

        if ($product_ids_table.Rows.Count -eq $batchSize) {
            $product_ids_bcp.WriteToServer($files_index_table)
            $product_ids_table.Rows.Clear()
            Write-Host "$product_ids_count <M_Prod_ID> elements processed so far"
        }
    }
}

# write any remaining rows to the server
if ($files_index_table.Rows.Count -gt 0) {
    $files_index_bcp.WriteToServer($files_index_table)
    $files_index_table.Rows.Clear()
}
Write-Host "$files_index_count <file> elements processed overall"

if ($product_ids_table.Rows.Count -gt 0) {
    $product_ids_bcp.WriteToServer($product_ids_table)
    $product_ids_table.Rows.Clear()
}
Write-Host "$product_ids_count <M_Prod_ID> elements processed overall"



回答2:


@Tomalak I managed to change your code and get it working, thanks very much for the help I couldn't have done it without your help and really appreciate you help and guidance, the code could probably be cleaned up a bit but I cant find any flaws in the data after a couple of days of testing. I ran it on a 3.6GB file which produced around 6.5 million rows in the files_index table and 7.4 million rows in the product_ids table, so I have now nearly a million potential skus that I can potentially match data on.

I changed it so it also added a row to the product_ids table even if there isnt a child node of M_Prod_ID with the Product ID and Prod_ID, this way it was easier to make a view to match the data. CODE BELOW.......

> Blockquote$connectionString = "Data Source=Apps2\Apps2;Initial 
 Catalog=ICECATtesting;Integrated Security=SSPI"
$batchSize = 100000

 # set up [files_index] datatable & read schema from DB
$files_index_table = New-Object System.Data.DataTable;
$files_index_adapter = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM 
files_index WHERE 0 = 1", $connectionString)
$files_index_adapter.Fill($files_index_table) | Out-Null;
$files_index_bcp = New-Object System.Data.SqlClient.SqlBulkCopy($connectionString)  
$files_index_bcp.DestinationTableName = "dbo.files_index"
$files_index_count = 0;

# set up [product_ids] datatable & read schema from DB
$product_ids_table = New-Object System.Data.DataTable
$product_ids_adapter = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM 
product_ids WHERE 0 = 1", $connectionString)
$product_ids_adapter.Fill($product_ids_table) | Out-Null
$product_ids_bcp = New-Object System.Data.SqlClient.SqlBulkCopy($connectionString)
$product_ids_bcp.DestinationTableName = "dbo.product_ids"
$product_ids_count = 0

 # main import loop

$xmlReader = New-Object System.Xml.XmlTextReader("C:\Scripts\icecat\files.index.xml")
while ($xmlReader.Read()) {
# skip any XML nodes that aren't elements
if ($xmlReader.NodeType -ne [System.Xml.XmlNodeType]::Element) { continue }

# handle <file> elements
if ($xmlReader.Name -eq "file") {
    $files_index_count++

    # remember current product ID, we'll need it when we hit the next <M_Prod_ID> 
  element also add the Prod_ID from the file node
    $curr_product_id = $xmlReader.GetAttribute("Product_ID")
    $curr_prod_id = $xmlReader.GetAttribute("Prod_ID")
    $is_new_file = $false

    $newRow = $files_index_table.NewRow()
    $newRow["Product_ID"] = $xmlReader.GetAttribute("Product_ID")
    $newRow["path"] = $xmlReader.GetAttribute("path")
    $newRow["Updated"] = $xmlReader.GetAttribute("Updated")
    $newRow["Quality"] = $xmlReader.GetAttribute("Quality")
    $newRow["Supplier_id"] = $xmlReader.GetAttribute("Supplier_id")
    $newRow["Prod_ID"] = $xmlReader.GetAttribute("Prod_ID")
    $newRow["Catid"] = $xmlReader.GetAttribute("Catid")
    $newRow["On_Market"] = $xmlReader.GetAttribute("On_Market")
    $newRow["Model_Name"] = $xmlReader.GetAttribute("Model_Name")
    $newRow["Product_View"] = $xmlReader.GetAttribute("Product_View")
    $newRow["HighPic"] = $xmlReader.GetAttribute("HighPic")
    $newRow["HighPicSize"] = $xmlReader.GetAttribute("HighPicSize")
    $newRow["HighPicWidth"] = $xmlReader.GetAttribute("HighPicWidth")
    $newRow["HighPicHeight"] = $xmlReader.GetAttribute("HighPicHeight")
    $newRow["Date_Added"] = $xmlReader.GetAttribute("Date_Added")
    $Firstproduct_id = $xmlreader.GetAttribute("Product_ID")
    $Firstprod_id = $xmlreader.GetAttribute("Prod_ID")
    $newfilenode = $true

    $files_index_table.Rows.Add($newRow) | Out-Null
    $newRow = $product_ids_table.NewRow()
    $newRow["Product_ID"] = $curr_product_id  # from above
    $newRow["Alternative_ID"] = $curr_prod_id
    $product_ids_table.Rows.Add($newRow) | Out-Null


    if ($files_index_table.Rows.Count -eq $batchSize) {
        $files_index_bcp.WriteToServer($files_index_table)
        $files_index_table.Rows.Clear()
        Write-Host "$files_index_count <file> elements processed so far"

    }
    # handle <M_Prod_ID> elements
    } elseif ($xmlReader.Name -eq "M_Prod_ID") {
         $product_ids_count++

         # add self-reference row to the [product_ids] table also I added the Prod_ID 
from the file node so I can make a view to match all variants

    # only for the first <M_Prod_ID> per <file> we need to do this


        $xmlreader.read()
        $newRow = $product_ids_table.NewRow()
        $newRow["Product_ID"] = $curr_product_id  # from above
        $newRow["Alternative_ID"] = $xmlReader.Value
        $product_ids_table.Rows.Add($newRow) | Out-Null


    if ($product_ids_table.Rows.Count -eq $batchSize) {
        $product_ids_bcp.WriteToServer($product_ids_table)
        $product_ids_table.Rows.Clear()
        Write-Host "$product_ids_count <M_Prod_ID> elements processed so far"
    }
  }
 }

 # write any remaining rows to the server
if ($files_index_table.Rows.Count -gt 0) {
$files_index_bcp.WriteToServer($files_index_table)
$files_index_table.Rows.Clear()
}
Write-Host "$files_index_count <file> elements processed overall"

if ($product_ids_table.Rows.Count -gt 0) {
    $product_ids_bcp.WriteToServer($product_ids_table)
    $product_ids_table.Rows.Clear()
 }
 Write-Host "$product_ids_count <M_Prod_ID> elements processed overall"


来源:https://stackoverflow.com/questions/55083962/importing-xml-into-sql-server-but-trying-to-make-multiple-entries-if-multiple-re

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!