We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file,
"is there a standard command line tool that will work on windows that does it?"
Yes. http://xponentsoftware.com/xmlSplit.aspx
There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.
It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:
<header>
<data rec="1"/>
<data rec="2"/>
<data rec="3"/>
<data rec="4"/>
<data rec="5"/>
<data rec="6"/>
</header>
you can output a copy of the file containing only data
elements within a certain range with this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="data"/>
</xsl:copy>
</xsl:template>
<xsl:template match="data">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
(Note, by the way, that because this is based on the identity transform, it works even if header
isn't the top-level element.)
You still need to count the data
elements in the source XML, and run the transform repeatedly with the values of $startPosition
and $endPosition
that are appropriate for the situation.
Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).
// from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
var FoundsPerFile = 200; // Global setting for number of found split strings per file.
var SplitString = "</letter>"; // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';
/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
var tabindex = -1; /* start value */
for (var i = 0; i < UltraEdit.document.length; i++)
{
if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
tabindex = i;
break;
}
}
return tabindex;
}
if (UltraEdit.document.length) { // Is any file open?
// Set working environment required for this job.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.hexOff();
UltraEdit.ueReOn();
// Move cursor to top of active file and run the initial search.
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.matchCase=true;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=false;
// If the string to split is not found in this file, do nothing.
if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
// This file is probably the correct file for this script.
var FileNumber = 1; // Counts the number of saved files.
var StringsFound = 1; // Counts the number of found split strings.
var NewFileIndex = UltraEdit.document.length;
/* Get the path of the current file to save the new
files in the same directory as the current file. */
var SavePath = "";
var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
if (LastBackSlash >= 0) {
LastBackSlash++;
SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
}
/* Get active file index in case of more than 1 file is open and the
current file does not get back the focus after closing the new files. */
var FileToSplit = getActiveDocumentIndex();
// Always use clipboard 9 for this script and not the Windows clipboard.
UltraEdit.selectClipboard(9);
// Split the file after every x found split strings until source file is empty.
while (1) {
while (StringsFound < FoundsPerFile) {
if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
else {
UltraEdit.document[FileToSplit].bottom();
break;
}
}
// End the selection of the find command.
UltraEdit.document[FileToSplit].endSelect();
// Move the cursor right to include the next character and unselect the found string.
UltraEdit.document[FileToSplit].key("RIGHT ARROW");
// Select from this cursor position everything to top of the file.
UltraEdit.document[FileToSplit].selectToTop();
// Is the file not already empty?
if (UltraEdit.document[FileToSplit].isSel()) {
// Cut the selection and paste it into a new file.
UltraEdit.document[FileToSplit].cut();
UltraEdit.newFile();
UltraEdit.document[NewFileIndex].setActive();
UltraEdit.activeDocument.paste();
/* Add line termination on the last line and remove automatically added indent
spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
// add headers and footers
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.write(xmlHead);
UltraEdit.activeDocument.write(xmlRootStart);
UltraEdit.activeDocument.bottom();
UltraEdit.activeDocument.write(xmlRootEnd);
// Build the file name for this new file.
var SaveFileName = SavePath + "LETTER";
if (FileNumber < 10) SaveFileName += "0";
SaveFileName += String(FileNumber) + ".raw.xml";
// Save the new file and close it.
UltraEdit.saveAs(SaveFileName);
UltraEdit.closeFile(SaveFileName,2);
FileNumber++;
StringsFound = 0;
/* Delete the line termination in the source file
if last found split string was at end of a line. */
UltraEdit.document[FileToSplit].endSelect();
UltraEdit.document[FileToSplit].key("END");
if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
UltraEdit.document[FileToSplit].top();
} else {
UltraEdit.document[FileToSplit].deleteLine();
}
} else break;
UltraEdit.outputWindow.write("Progress " + SaveFileName);
} // Loop executed until source file is empty!
// Close source file without saving and re-open it.
var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
UltraEdit.closeFile(NameOfFileToSplit,2);
/* The following code line could be commented if the source
file is not needed anymore for further actions. */
UltraEdit.open(NameOfFileToSplit);
// Free memory and switch back to Windows clipboard.
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);
}
}
First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip
Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.
There is a script code on that page (starts with split()
) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:
split()
{
CMarkup xmlInput, xmlOutput;
xmlInput.Open( "**50MB.xml**", MDF_READFILE );
int nObjectCount = 0, nFileCount = 0;
while ( xmlInput.FindElem("//**ACT**") )
{
if ( nObjectCount == 0 )
{
++nFileCount;
xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
xmlOutput.AddElem( "**root**" );
xmlOutput.IntoElem();
}
xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
++nObjectCount;
if ( nObjectCount == **5** )
{
xmlOutput.Close();
nObjectCount = 0;
}
}
if ( nObjectCount )
xmlOutput.Close();
xmlInput.Close();
return nFileCount;
}
Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)
On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.
Note:
input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml"
(double slashes)
and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"
There is nothing built in that can handle this situation easily.
Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".
Update:
After a bit of digging, I found this article describing a way to split files using XSLT.
xml_split - split huge XML documents into smaller chunks
http://www.perlmonks.org/index.pl?node_id=429707
http://metacpan.org/pod/XML::Twig