Complex XML to TSV using XSLT

风格不统一 提交于 2020-01-04 15:54:31

问题


I have found a couple of previous questions that address parts of my problem (see here and here, but I'm having trouble integrating them. I have a set of XML records that I want to transform to tab-delimited format. However, not all the XML records have all fields, and some contain multiple instances of a field.

Two sample XML records:

<?xml version="1.0" encoding="UTF-8" ?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">
    <marc:record>
        <marc:leader>02179 am a  002893u     </marc:leader>
        <marc:controlfield tag="001">12789</marc:controlfield>
        <marc:controlfield tag="005">20120521</marc:controlfield>
        <marc:controlfield tag="007">cuuuu---auuuu</marc:controlfield>
        <marc:controlfield tag="008">120521s||||    xx      o     0   u ||| |</marc:controlfield>
        <marc:datafield tag="020" ind1=" " ind2=" ">
            <marc:subfield code="a">9789089640574</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="100" ind1="1" ind2=" ">
            <marc:subfield code="a">Rooij van ,Robert</marc:subfield>
            <marc:subfield code="4">aut</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="245" ind1="1" ind2=" ">
            <marc:subfield code="a">New Perspectives on Games and Interaction</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="260" ind1=" " ind2=" ">
            <marc:subfield code="b">Amsterdam University Press</marc:subfield>
            <marc:subfield code="c">2008</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="300" ind1=" " ind2=" ">
            <marc:subfield code="a">1 electronic resource (330 p.)</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="520" ind1=" " ind2=" ">
            <marc:subfield code="a">This volume is a collection of papers ...</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="650" ind1=" " ind2="0">
            <marc:subfield code="a">Mathematics</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="650" ind1=" " ind2="0">
            <marc:subfield code="a">Philosophy (General)</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="650" ind1=" " ind2="0">
            <marc:subfield code="a">Economic theory. Demography</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Economics</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Philosophy</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Mathematics</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Economie</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Filosofie</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="653" ind1=" " ind2=" ">
            <marc:subfield code="a">Wiskunde</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="700" ind1="1" ind2=" ">
            <marc:subfield code="a">Apt ,Krzysztof</marc:subfield>
            <marc:subfield code="4">aut</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="856" ind1="4" ind2="0">
            <marc:subfield code="u">http://www.doabooks.org/doab?func=fulltext&amp;rid=12789</marc:subfield>
            <marc:subfield code="z">Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial (CC by-nc)</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="856" ind1="4" ind2="0">
            <marc:subfield code="u">http://www.oapen.org/download?type=document&amp;docid=340074</marc:subfield>
        </marc:datafield>
    </marc:record>
    <marc:record>
        <marc:leader>01452 am a  001933u     </marc:leader>
        <marc:controlfield tag="001">15497</marc:controlfield>
        <marc:controlfield tag="005">20140217</marc:controlfield>
        <marc:controlfield tag="007">cuuuu---auuuu</marc:controlfield>
        <marc:controlfield tag="008">140217s||||    xx      o     0   u ||| |</marc:controlfield>
        <marc:datafield tag="020" ind1=" " ind2=" ">
            <marc:subfield code="a">9788867050673</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="100" ind1="1" ind2=" ">
            <marc:subfield code="a">Emanuele Haus</marc:subfield>
            <marc:subfield code="4">aut</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="245" ind1="1" ind2=" ">
            <marc:subfield code="a">Dynamics of an elastic satellite with internal friction.</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="260" ind1=" " ind2=" ">
            <marc:subfield code="b">Ledizioni - LediPublishing</marc:subfield>
            <marc:subfield code="c">2013</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="300" ind1=" " ind2=" ">
            <marc:subfield code="a">1 electronic resource ( p.)</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="520" ind1=" " ind2=" ">
            <marc:subfield code="a">n this thesis, we study the dynamics...</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="546" ind1=" " ind2=" ">
            <marc:subfield code="a">english</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="650" ind1=" " ind2="0">
            <marc:subfield code="a">Mathematics</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="856" ind1="4" ind2="0">
            <marc:subfield code="u">http://www.doabooks.org/doab?func=fulltext&amp;rid=15497</marc:subfield>
            <marc:subfield code="z">Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial Share Alike (CC by-nc-sa)</marc:subfield>
        </marc:datafield>
        <marc:datafield tag="856" ind1="4" ind2="0">
            <marc:subfield code="u">http://www.ledizioni.it/stag/wp-content/uploads/2014/02/tesi_haus.pdf</marc:subfield>
        </marc:datafield>
    </marc:record>
</marc:collection>

I've been trying to adapt the XSLT from this previous answer, with little luck so far:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xpath-default-namespace="http://www.loc.gov/MARC21/slim">
    <xsl:output method="text"/>
    <xsl:variable name="delimiter" select="'&#09;'"/>

    <xsl:strip-space elements="*"/>
    <xsl:output method="text"/>

    <xsl:key name="field" 
      match="/collection/record/datafield/subfield" 
      use="concat(../@tag,@code)"/>

    <!-- variable containing the first occurrence of each field -->
    <xsl:variable name="allFields"
        select="/collection/record/datafield/subfield
                [generate-id()
                 =generate-id(key('field', 
                                   concat(../@tag,@code))[1])]" />

    <xsl:template match="/">

        <xsl:for-each select="$allFields">
            <xsl:sort select="substring(concat(../@tag,@code),1,3)"
                      data-type="number"/>
            <xsl:value-of select="concat(../@tag,@code)" />
            <xsl:if test="position() &lt; last()">
                <xsl:value-of select="$delimiter" />
            </xsl:if>
        </xsl:for-each>
        <xsl:text>&#10;</xsl:text>
        <xsl:apply-templates select="*/*" />
    </xsl:template>

    <xsl:template match="*">
        <xsl:variable name="this" select="." />

        <xsl:for-each select="$allFields">
            <xsl:sort 
              select="substring(concat(../@tag,@code),1,3)" 
              data-type="number"/>
            <xsl:value-of 
              select="$this/*[@code = current()/@code]" />
            <xsl:if test="position() &lt; last()">
                <xsl:value-of select="$delimiter" />
            </xsl:if>
        </xsl:for-each>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>
</xsl:stylesheet>

In the output I'm trying to achieve, the header would consist of the leader followed by the unique values of @tag (concatenated with subfield/@code for subfields), sorted in ascending order by tag:

leader  001 005 007 008 020a    100a    1004    245a    260b    260c    300a    520a    546a    650a    653a    700a    7004    856u    856z

If a record has multiple values for a single field/subfield combination, I want to concantenate them together, for example:

653a
Economics|Philosophy|Mathematics

However, if a record is missing a particular field, I want to just output a tab character, to keep everything aligned.

Full sample TSV output:

leader  001 005 007 008 020a    100a    1004    245a    260b    260c    300a    520a    546a    650a    653a    700a    7004    856u    856z                                        
02179 am a  002893u         12789   20120521    cuuuu---auuuu   120521s||||    xx      o     0   u ||| |    9789089640574   Rooij van ,Robert   aut New Perspectives on Games and Interaction   Amsterdam University Press  2008    1 electronic resource (330 p.)  This volume is a collection of papers       Mathematics|Philosophy (General)|Economic theory. Demography    Economics|Philosophy|Mathematics|Economie|Filosofie|Wiskunde    Apt ,Krzysztof< aut http://www.doabooks.org/doab?func=fulltext&amp;rid=12789|http://www.oapen.org/download?type=document&amp;docid=340074   Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial (CC by-nc)                                       
01452 am a  001933u         15497   20140217    cuuuu---auuuu   140217s||||    xx      o     0   u ||| |    9788867050673   Emanuele Haus   aut Dynamics of an elastic satellite with internal friction.    Ledizioni - LediPublishing  2013    1 electronic resource ( p.) In this thesis, we study the dynamics of an elastic body    english Mathematics             http://www.doabooks.org/doab?func=fulltext&amp;rid=15497|http://www.ledizioni.it/stag/wp-content/uploads/2014/02/tesi_haus.pdf  Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial Share Alike (CC by-nc-sa)                                        

回答1:


I would suggest you try it this way:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:marc="http://www.loc.gov/MARC21/slim"
exclude-result-prefixes="marc">
<xsl:output method="text" encoding="UTF-8"/>

<xsl:variable name="fields">
    <xsl:for-each-group select="/marc:collection/marc:record/marc:datafield" group-by="@tag">
        <xsl:sort select="@tag"/>
            <xsl:for-each select="marc:subfield">
                <xsl:sort/>
                <field tag="{current-grouping-key()}" code="{@code}">a</field>
            </xsl:for-each>
    </xsl:for-each-group>
</xsl:variable>

<xsl:template match="/">
    <!-- header -->
    <xsl:for-each select="$fields/field">
        <xsl:value-of select="@tag"/>
        <xsl:value-of select="@code"/>
        <xsl:if test="position()!=last()">
            <xsl:text>&#9;</xsl:text>
        </xsl:if>
    </xsl:for-each>
    <xsl:text>&#10;</xsl:text>
    <!-- data -->
    <xsl:for-each select="marc:collection/marc:record">
        <xsl:variable name="current-record" select="." />
        <xsl:for-each select="$fields/field">
            <xsl:value-of select="$current-record/marc:datafield[@tag=current()/@tag]/marc:subfield[@code=current()/@code]" separator="|"/>
            <xsl:if test="position()!=last()">
                <xsl:text>&#9;</xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:if test="position()!=last()">
            <xsl:text>&#10;</xsl:text>
        </xsl:if>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

The result, when applied to your example input:

020a    100a    1004    245a    260c    260b    300a    520a    546a    650a    653a    700a    7004    856z    856u
9789089640574   Rooij van ,Robert   aut New Perspectives on Games and Interaction   2008    Amsterdam University Press  1 electronic resource (330 p.)  This volume is a collection of papers ...       Mathematics|Philosophy (General)|Economic theory. Demography    Economics|Philosophy|Mathematics|Economie|Filosofie|Wiskunde    Apt ,Krzysztof  aut Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial (CC by-nc)   http://www.doabooks.org/doab?func=fulltext&rid=12789|http://www.oapen.org/download?type=document&docid=340074
9788867050673   Emanuele Haus   aut Dynamics of an elastic satellite with internal friction.    2013    Ledizioni - LediPublishing  1 electronic resource ( p.) n this thesis, we study the dynamics... english Mathematics             Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial Share Alike (CC by-nc-sa)    http://www.doabooks.org/doab?func=fulltext&rid=15497|http://www.ledizioni.it/stag/wp-content/uploads/2014/02/tesi_haus.pdf

Note: I couldn't figure out the role of the "leader" in either the input or the output.




回答2:


You say "if a record is missing a particular field" -- from this I infer that you must have a list of the fields you want to export. (All of MARC? Every theoretically possible field from 000 to 999? only you can say, and you haven't said.) If you don't have a list of the fields you want to export, then your problem statement is self-contradictory and you need to understand the problem better.

Let us say, for example, that you want to export the fields listed in the variable $fields.

<xsl:variable name="fields" as="xs:string*"
  select="tokenize('001 005 007 008 020 
                    100 245 260 260 300 
                    520 546 650 653 700 
                    856', '\s+')"/>

Your current problem is that your output is being shaped by the fields present in the input, in what many XSLT programmers call a 'push' stylesheet. You want the output to be shaped by the list of fields in $fields, not by the input -- you want what those XSLT programmers call a 'pull' stylesheet. Pull stylesheets are common when we are preparing data for non-XML systems like spreadsheets, which aren't very good about variations in structure; they are also common among procedural programmers who know no other way to think about problems. Both of these lead some XSLT programmers to look down their noses a bit at pull stylesheets, but if you have described your problem correctly, a pull stylesheet is what you need.

From what has been said so far, you should be able to see that your problem is that the template for / is constructing the output by processing the input, with <xsl:apply-templates select="*/*" />. If the input has no 546 fields, there is no opportunity to insert a tab where they would have appeared, without a lot of unnecessary effort.

You want to replace the current apply-templates, which iterates over the grand-children, with a construct that iterates over the field numbers in $fields, and for each field number emits a tab and any other appropriate information, where the other appropriate information depends on whether fields with that number are present in the input or not. In XSLT 3.0 you will be able to apply templates to a sequence of values, so you could write <xsl:apply-templates select="$fields"/>, but in 2.0, that's not an option. Options available in 2.0 include:

  • Represent $fields not as a sequence of strings but as a sequence of elements; call <xsl:apply-templates select="$fields"/> to iterate over the desired field numbers. You will need to remember to pass in a node from the input document (the root is a good choice), so you can get back into it from the template for the field number.

  • Call a named template with $fields as a parameter; in the named template, pick off the first field number from the list, process it, and then call the same named template recursively, with the remainder of the list. If there is no first field number, the sequence of field numbers is empty, and you're done.

  • Write a recursive function that works in the same way as the named template just described.

  • Write a function that handles one field number for one MARC record, and call it from an XPath for expression:

    <xsl:template match="marc:record">
      ...
      <xsl:sequence select="for $fn in $fields
         return my:one-field-one-record($fn, .)
         "/>
      ...
    </xsl:template>
    



回答3:


This is possible in XSLT 1.0 as well.

The following solution is built around a document-wide list of unique tags and iterating that list for every record. In effect this allows outputting delimiters even when a particular tag is not present in a record.

<xsl:stylesheet version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:marc="http://www.loc.gov/MARC21/slim"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>
  <xsl:output method="text" encoding="Windows-1252" />

  <xsl:param name="hDelim" select="'&#x9;'" /><!-- vertical delimiter -->
  <xsl:param name="vDelim" select="'&#xA;'" /><!-- horizontal delimiter -->
  <xsl:param name="sDelim" select="'|'" /><!-- subfield delimiter -->

  <!-- group tags by @tag + @code -->
  <xsl:key name="kAllTags" match="marc:controlfield | marc:subfield" use="
    concat(@tag, ../@tag, @code)
  " />
  <!-- group tags by record ID +  @tag + @code -->
  <xsl:key name="kRecordTags" match="marc:controlfield | marc:subfield" use="
    concat(generate-id(ancestor::marc:record), ':', @tag|../@tag, @code)
  " />
  <!-- build a list of unique tags to iterate over -->
  <xsl:variable name="uniqueTags" select="
    (//marc:controlfield | //marc:subfield)[
      generate-id() = generate-id(key('kAllTags', concat(@tag | ../@tag, @code)))
    ]
  " />

  <xsl:template match="marc:collection">
    <!-- write header line -->
    <xsl:text>leader</xsl:text>
    <xsl:value-of select="$hDelim" />

    <xsl:apply-templates select="$uniqueTags" mode="head">
      <xsl:sort select="concat(@tag|../@tag, @code)" />
    </xsl:apply-templates>
    <xsl:value-of select="$vDelim" />

    <!-- write individual records -->
    <xsl:apply-templates select="marc:record" />
  </xsl:template>

  <xsl:template match="marc:record">
    <xsl:variable name="recordId" select="generate-id()" />

    <xsl:value-of select="marc:leader" />
    <xsl:value-of select="$hDelim" />

    <!-- for each unique tag, find the fields that have that tag on this record -->
    <xsl:for-each select="$uniqueTags">
      <xsl:variable name="tagKey" select="concat($recordId, ':', @tag|../@tag, @code)" />
      <xsl:apply-templates select="key('kRecordTags', $tagKey)" mode="data" />
      <xsl:if test="position() != last()"><xsl:value-of select="$hDelim" /></xsl:if>
    </xsl:for-each>
    <xsl:if test="position() != last()"><xsl:value-of select="$vDelim" /></xsl:if>
  </xsl:template>

  <xsl:template match="marc:controlfield | marc:subfield" mode="head">
    <xsl:value-of select="concat(@tag|../@tag, @code)" />
    <xsl:if test="position() != last()"><xsl:value-of select="$hDelim" /></xsl:if>
  </xsl:template>

  <xsl:template match="marc:controlfield | marc:subfield" mode="data">
    <xsl:value-of select="normalize-space()" />
    <xsl:if test="position() != last()"><xsl:value-of select="$sDelim" /></xsl:if>
  </xsl:template>
</xsl:stylesheet>

This template generates, with your input data:

leader  001 005 007 008 020a    1004    100a    245a    260b    260c    300a    520a    546a    650a    653a    7004    700a    856u    856z
02179 am a  002893u         12789   20120521    cuuuu---auuuu   120521s|||| xx o 0 u ||| |  9789089640574   Rooij van ,Robert   aut New Perspectives on Games and Interaction   Amsterdam University Press  2008    1 electronic resource (330 p.)  This volume is a collection of papers ...   Mathematics|Philosophy (General)|Economic theory. Demography    Economics|Philosophy|Mathematics|Economie|Filosofie|Wiskunde    Apt ,Krzysztof  aut http://www.doabooks.org/doab?func=fulltext&rid=12789|http://www.oapen.org/download?type=document&docid=340074   Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial (CC by-nc)   
01452 am a  001933u         15497   20140217    cuuuu---auuuu   140217s|||| xx o 0 u ||| |  9788867050673   Emanuele Haus   aut Dynamics of an elastic satellite with internal friction.    Ledizioni - LediPublishing  2013    1 electronic resource ( p.) n this thesis, we study the dynamics... Mathematics             http://www.doabooks.org/doab?func=fulltext&rid=15497|http://www.ledizioni.it/stag/wp-content/uploads/2014/02/tesi_haus.pdf  Description of rights in Directory of Open Access Books (DOAB): Attribution Non-commercial Share Alike (CC by-nc-sa)    english


来源:https://stackoverflow.com/questions/27319143/complex-xml-to-tsv-using-xslt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!