Importing Large XML file into SQL 2.5Gb

前端 未结 3 2004
予麋鹿
予麋鹿 2020-12-22 11:43

Hi I am trying to import a large XML file into a table on my sql server (2014)

I have used the code below for smaller files and thought it would be ok as this is a o

相关标签:
3条回答
  • 2020-12-22 12:40

    Try this. Just another method that I have used for some time. It's pretty fast (could be faster). I pull a huge xml db from a gaming company every night. This is how i get it an import it.

     $xml  = new XMLReader();            
     $xml->open($xml_file); // file is your xml file you want to parse
     while($xml->read() && $xml->name != 'game') { ; } // get past the header to your first record (game in my case)
    
    while($xml->name == 'game') { // now while we are in this record               
                    $element        = new SimpleXMLElement($xml->readOuterXML());
                    $gameRec        = $this->createGameRecord($element, $os); // this is my function to reduce some clutter - and I use it elsewhere too
    
                    /* this looks confusing, but it is not. There are over 20 fields, and instead of typing them all out, I just made a string. */
                    $sql = "INSERT INTO $table (";
                    foreach($gameRec as $field=>$game){
                    $sql .= " $field,";
                    }
                    $sql = rtrim($sql, ",");
                    $sql .=") values (";
    
                    foreach($gameRec as $field=>$game) {
                        $sql .= " :$field,";               
                    }
                    $sql = rtrim($sql,",");
                    $sql .= ") ON DUPLICATE KEY UPDATE "; // online game doesn't have a gamerank - not my choice LOL, so I adjust that for here
    
                    switch ($os) {
                        case 'pc' : $sql .= "gamerank = ".$gameRec['gamerank']        ; break;
                        case 'mac': $sql .= "gamerank = ".$gameRec['gamerank']        ; break;
                        case 'pl' : $sql .= "playercount = ".$gameRec['playercount']  ; break;
                        case 'og' :
                            $playercount = $this->getPlayerCount($gameRec['gameid']);
                            $sql .= "playercount = ".$playercount['playercount']  ;
                            break;
    
                    }
    
    
                    try {
    
                        $stmt = $this->connect()->prepare($sql);
                        $stmt->execute($gameRec);
    
                    } catch (PDOException $e) {// Kludge
    
                        echo 'os: '.$os.'<br/>table: '.$table.'<br/>XML LINK: '.$comprehensive_xml.'<br/>Current Record:<br/><pre>'.print_r($gameRec).'</pre><br/>'.
                        'SQL: '.$sql.'<br/>';
                        die('Line:33<br/>Function: pullBFG()<BR/>Cannot add game record <br/>'.$e->getMessage());
    
                    }
    
                    /// VERY VERY VERY IMPORTANT do not forget these 2 lines, or it will go into a endless loop - I know, I've done it. locks up your system after a bit hahaah
                    $xml->next('game');
                    unset($element);
                }// while there are games
    

    This should get you started. Obviously, adjust the "game" to your xml records. Trim out the fat I have here.

    Here is the createGameRecord($element, $type='pc') Basically it turns it into an array to use elsewhere, and makes it easier to add it to the db. with a single line as seen above: $stmt->execute($gameRec); Where $gameRec was returned from this function. PDO knows gameRec is an array, and will parse it out as you INSERT IT. the "delHardReturns() is another of my fucntion that gets rid of those hard returns /r /n etc.. Seems to mess up the SQL. I think SQL has a function for that, but I have not pursed it. Hope you find this useful.

    private function createGameRecord($element, $type='pc') {
                if( ($type == 'pc') || ($type == 'og') ) { // player count is handled separately
                    $game = array(
                        'gamename'                  => strval($element->gamename),
                        'gameid'                    => strval($element->gameid),                
                        'genreid'                   => strval($element->genreid),
                        'allgenreid'                => strval($element->allgenreid),
                        'shortdesc'                 => $this->delHardReturns(strval($element->shortdesc)),
                        'meddesc'                   => $this->delHardReturns(strval($element->meddesc)),
                        'bullet1'                   => $this->delHardReturns(strval($element->bullet1)),
                        'bullet2'                   => $this->delHardReturns(strval($element->bullet2)),
                        'bullet3'                   => $this->delHardReturns(strval($element->bullet3)),
                        'bullet4'                   => $this->delHardReturns(strval($element->bullet4)),
                        'bullet5'                   => $this->delHardReturns(strval($element->bullet5)),
                        'longdesc'                  => $this->delHardReturns(strval($element->longdesc)),
                        'foldername'                => strval($element->foldername),
                        'hasdownload'               => strval($element->hasdownload),
                        'hasdwfeature'              => strval($element->hasdwfeature),                             
                        'releasedate'               => strval($element->releasedate)
    
                    );
    
                    if($type === 'pc')  {
    
                        $game['hasvideo']           = strval($element->hasvideo);
                        $game['hasflash']           = strval($element->hasflash);
                        $game['price']              = strval($element->price); 
                        $game['gamerank']           = strval($element->gamerank);
                        $game['gamesize']           = strval($element->gamesize);
                        $game['macgameid']          = strval($element->macgameid);
                        $game['family']             = strval($element->family);
                        $game['familyid']           = strval($element->familyid);
                        $game['productid']          = strval($element->productid);
                        $game['pc_sysreqos']        = strval($element->systemreq->pc->sysreqos);
                        $game['pc_sysreqmhz']       = strval($element->systemreq->pc->sysreqmhz);
                        $game['pc_sysreqmem']       = strval($element->systemreq->pc->sysreqmem);
                        $game['pc_sysreqhd']        = strval($element->systemreq->pc->sysreqhd);
    
                        if(empty($game['gamerank'])) $game['gamerank'] = 99999;
    
                        $game['gamesize'] = $this->readableBytes((int)$game['gamesize']);  
    
    
                    }// dealing with PC type
    
                    if($type === 'og') {
                        $game['onlineiframeheight']              = strval($element->onlineiframeheight);
                        $game['onlineiframewidth']              = strval($element->onlineiframewidth); 
    
                    }
    
                    $game['releasedate']            = substr($game['releasedate'],0,10);
    
                } else {// not type = pl
    
                    $game['playercount']            = strval($element->playercount);
                    $game['gameid']                 = strval($element->gameid);
                }// no type = pl else
    
    
                return $game;
            }/
    
    0 讨论(0)
  • 2020-12-22 12:44

    Updated: Much faster. I did some research, and while the above post I made shows one (slow) method, I was able to find one that works even faster - for me it does. I put this as a new answer due to the complete difference from my previous post.

    LOAD XML LOCAL INFILE 'path/to/file.xml' INTO TABLE tablename ROWS IDENTIFIED BY '<xml-identifier>'
    

    Example

    <students>
        <student>
           <name>john doe</name>
              <boringfields>bla bla bla......</boringfields>
        </student>
    </students>
    

    Then, MYSQL command would be:

    LOAD XML LOCAL INFILE 'path/to/students.xml' INTO TABLE tablename ROWS IDENTIFIED BY '<student>'
    

    rows identified must have single quote and angle brackets. when I switched to this method, I went from 12min +/- to 30 seconds!! +/-

    tips that worked for me. was use the DELETE FROM tablename otherwise it will just append to your db.

    Ref: https://dev.mysql.com/doc/refman/5.5/en/load-xml.html

    0 讨论(0)
  • 2020-12-22 12:48

    The max size of an XML column value in SQL Server is 2GB. It will not be possible to import a 2.5GB file into a single XML column.

    UPDATE

    Since your underlying objective is to transform XML elements within the file into table rows, you don't need to stage the entire file contents into a single XML column. You can avoid the 2GB limitation, reduce memory requirements, and improve performance by shredding the XML in client code and using a bulk insert technique to insert batches of multiple rows.

    The example Powershell script below uses an XmlTextReader to avoid reading the entire XML into a DOM and uses SqlBulkCopy to insert batches of many rows at once. The combination of these techniques should allow you to insert millions rows in minutes rather than hours. These same techniques can be implemented in a custom app or SSIS script task.

    I noticed a couple of the table columns specify varchar(1) yet the XML attribute values contain many characters. You'll need to either expand length of the columns or transform the source values.

    [String]$global:connectionString = "Data Source=YourServer;Initial Catalog=YourDatabase;Integrated Security=SSPI";
    [System.Data.DataTable]$global:dt = New-Object System.Data.DataTable;
    [System.Xml.XmlTextReader]$global:xmlReader = New-Object System.Xml.XmlTextReader("C:\FilesToImport\files.xml");
    [Int32]$global:batchSize = 10000;
    
    Function Add-FileRow() {
        $newRow = $dt.NewRow();
        $null = $dt.Rows.Add($newRow);
        $newRow["Product_ID"] = $global:xmlReader.GetAttribute("Product_ID");
        $newRow["path"] = $global:xmlReader.GetAttribute("path");
        $newRow["Updated"] = $global:xmlReader.GetAttribute("Updated");
        $newRow["Quality"] = $global:xmlReader.GetAttribute("Quality");
        $newRow["Supplier_id"] = $global:xmlReader.GetAttribute("Supplier_id");
        $newRow["Prod_ID"] = $global:xmlReader.GetAttribute("Prod_ID");
        $newRow["Catid"] = $global:xmlReader.GetAttribute("Catid");
        $newRow["On_Market"] = $global:xmlReader.GetAttribute("On_Market");
        $newRow["Model_Name"] = $global:xmlReader.GetAttribute("Model_Name");
        $newRow["Product_View"] = $global:xmlReader.GetAttribute("Product_View");
        $newRow["HighPic"] = $global:xmlReader.GetAttribute("HighPic");
        $newRow["HighPicSize"] = $global:xmlReader.GetAttribute("HighPicSize");
        $newRow["HighPicWidth"] = $global:xmlReader.GetAttribute("HighPicWidth");
        $newRow["HighPicHeight"] = $global:xmlReader.GetAttribute("HighPicHeight");
        $newRow["Date_Added"] = $global:xmlReader.GetAttribute("Date_Added");
    }
    
    try
    {
    
        # init data table schema
        $da = New-Object System.Data.SqlClient.SqlDataAdapter("SELECT * FROM dbo.files_index WHERE 0 = 1;", $global:connectionString);
        $null = $da.Fill($global:dt);
        $bcp = New-Object System.Data.SqlClient.SqlBulkCopy($global:connectionString);
        $bcp.DestinationTableName = "dbo.files_index";
    
        $recordCount = 0;
    
        while($xmlReader.Read() -eq $true)
        {
    
            if(($xmlReader.NodeType -eq [System.Xml.XmlNodeType]::Element) -and ($xmlReader.Name -eq "file"))
            {
                Add-FileRow -xmlReader $xmlReader;
                $recordCount += 1;
                if(($recordCount % $global:batchSize) -eq 0) 
                {
                    $bcp.WriteToServer($dt);
                    $dt.Rows.Clear();
                    Write-Host "$recordCount file elements processed so far";
                }
            }
    
        }
    
        if($dt.Rows.Count -gt 0)
        {
            $bcp.WriteToServer($dt);
        }
    
        $bcp.Close();
        $xmlReader.Close();
    
        Write-Host "$recordCount file elements imported";
    
    }
    catch
    {
        throw;
    }
    
    0 讨论(0)
提交回复
热议问题