A way to split a huge XML file into smaller xml files with XSL

倾然丶 夕夏残阳落幕 提交于 2019-12-11 07:08:09

问题


I get a huge XML file containing a list of TV broadcasts. And I have to split it up into small files containing all broadcasts for one day only. I managed to to that but have two problems with the xml header and a node being there multiple times.

The structure of the XML is the following:

<?xml version="1.0" encoding="UTF-8"?>
<broadcasts>
    <broadcast>
    <id>4637445812</id>
    <week>39</week>
    <date>2009-09-22</date>
    <time>21:45:00:00</time>
        ... (some more)
    </broadcast>
    ... (long list of broadcast nodes)
</broadcasts>

My XSL looks like this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:redirect="http://xml.apache.org/xalan/redirect"
        extension-element-prefixes="redirect"
        version="1.0">
    <!-- mark the CDATA escaped tags -->
    <xsl:output method="xml" cdata-section-elements="title text"
        indent="yes" omit-xml-declaration="no" />

    <xsl:template match="broadcasts">
        <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="broadcast">
    <!-- Build filename PRG_YYYYMMDD.xml -->
    <xsl:variable name="filename" select="concat(substring(date,1,4),substring(date,6,2))"/>
    <xsl:variable name="filename" select="concat($filename,substring(date,9,2))" />
    <xsl:variable name="filename" select="concat($filename,'.xml')" />
    <redirect:write select="concat('PRG_',$filename)" append="true">    

        <schedule>  
        <broadcast program="TEST">
            <!-- format timestamp in specific way -->
            <xsl:variable name="tmstmp" select="concat(substring(date,9,2),'/')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(date,6,2))"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,'/')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(date,1,4))"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,' ')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(time,1,5))"/>

            <timestamp><xsl:value-of select="$tmstmp"/></timestamp>
            <xsl:copy-of select="title"/>
            <text><xsl:value-of select="subtitle"/></text>

            <xsl:variable name="newVps" select="concat(substring(vps,1,2),substring(vps,4,2))"/>
            <xsl:variable name="newVps" select="concat($newVps,substring(vps,7,2))"/>
            <xsl:variable name="newVps" select="concat($newVps,substring(vps,10,2))"/>
            <vps><xsl:value-of select="$newVps"/></vps>                    
            <nextday>false</nextday>               
        </broadcast>      
        </schedule>
    </redirect:write>
    </xsl:template> 
</xsl:stylesheet>

My output XMLs are like this:

PRG_20090512.xml:

<?xml version="1.0" encoding="UTF-8"?>
  <schedule>
    <broadcast program="TEST">
      <timestamp>01/03/2010 06:00</timestamp>
      <title><![CDATA[TELEKOLLEG  Geschichte ]]></title>
      <text><![CDATA[Giganten in Fernost]]></text>
      <vps>06000000</vps>
      <nextday>false</nextday>
    </broadcast>
  </schedule>
<?xml version="1.0" encoding="UTF-8"?> <!-- don't want this -->
  <schedule>  <!-- don't want this -->
    <broadcast program="TEST">
      <timestamp>01/03/2010 06:30</timestamp>
      <title><![CDATA[Die chemische Bindung]]></title>
      <text/>
      <vps>06300000</vps>
      <nextday>false</nextday>
    </broadcast>
  </schedule>
<?xml version="1.0" encoding="UTF-8"?>
...and so on

I can put in omit-xml-declaration="yes" in the output declaration, but the I don't have any xml header. I tried to put in a check if the tag is already in the output, but failed to select nodes in the output...

This is what I tried:

<xsl:choose>
  <xsl:when test="count(schedule) = 0"> <!-- schedule needed -->   
    <schedule>
      <broadcast>
    ...
  <xsl:otherwise> <!-- no schedule needed -->
    <broadcast>
    ...

Thanks for any help, as I'm unaware how to handle that. ;( YeTI


回答1:


Write a single file at a time, containing all broadcasts for that date.

This becomes a problem of grouping the input elements by date. As Xalan is XSLT 1.0, you do this with keys.

We define a key to group broadcasts by date. The we select each broadcast that is the first of its date. Then select all the broadcasts for the same date using the key function.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:redirect="http://xml.apache.org/xalan/redirect"
                extension-element-prefixes="redirect"
                version="1.0">

    <!-- mark the CDATA escaped tags --> 
    <xsl:output method="xml" cdata-section-elements="title text" indent="yes" omit-xml-declaration="no" />

    <xsl:key name="date" match="broadcast" use="date" />

    <xsl:template match="broadcasts">
        <xsl:apply-templates select="broadcast[generate-id(.)=generate-id(key('date',date)[1])]"/>
    </xsl:template>

    <xsl:template match="broadcast">
        <!-- Build filename PRG_YYYYMMDD.xml -->
        <xsl:variable name="filename" select="concat(substring(date,1,4),substring(date,6,2))"/>
        <xsl:variable name="filename" select="concat($filename,substring(date,9,2))" />
        <xsl:variable name="filename" select="concat($filename,'.xml')" />

        <redirect:write select="concat('PRG_',$filename)" append="true">        

            <schedule>
                <xsl:apply-templates select="key('date',date)" mode="broadcast" />
            </schedule>

        </redirect:write>

    </xsl:template>

    <xsl:template match="broadcast" mode="broadcast">
        <broadcast program="TEST">
            <!-- format timestamp in specific way -->
            <xsl:variable name="tmstmp" select="concat(substring(date,9,2),'/')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(date,6,2))"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,'/')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(date,1,4))"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,' ')"/>
            <xsl:variable name="tmstmp" select="concat($tmstmp,substring(time,1,5))"/>

            <timestamp><xsl:value-of select="$tmstmp"/></timestamp>
            <xsl:copy-of select="title"/>
            <text><xsl:value-of select="subtitle"/></text>

            <xsl:variable name="newVps" select="concat(substring(vps,1,2),substring(vps,4,2))"/>
            <xsl:variable name="newVps" select="concat($newVps,substring(vps,7,2))"/>
            <xsl:variable name="newVps" select="concat($newVps,substring(vps,10,2))"/>
            <vps><xsl:value-of select="$newVps"/></vps>                                     
            <nextday>false</nextday>                             
        </broadcast>
    </xsl:template>

</xsl:stylesheet>



回答2:


Wrap your schedule elements with a unique parent and see if that makes the problem go away.

I'm not familiar with this particular problem, but my guess is that it's caused by your trying to generate XML documents with multiple top level elements. Every XML document must have exactly one top level element (a stupid requirement if you ask me, e.g. it makes XML unsuitable for logfiles, but that's the way it is).



来源:https://stackoverflow.com/questions/2424209/a-way-to-split-a-huge-xml-file-into-smaller-xml-files-with-xsl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!