Extracting textual content from XML documents using XSLT [closed]

孤街醉人 提交于 2019-12-25 18:57:16

问题


How it is possible to extract textual content of an XML document preferably using XSLT.

For such fragment,

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>

the desired result is :

textual content, textual content, textual content

What's the best format for output (table, CSV, etc,) in which the content be processable for further operation, such as text mining?

Thanks

Update

To extend the question, how it’s possible to extract content of each record separately. For example, for the below XML:

<Records>
<record id="1">
    <tag1>textual co</tag1>
    <tag2>textual con</tag2>
    <tag2>textual cont</tag2>
</record>
<record id="2">
    <tag1>some text</tag1>
    <tag2>some tex</tag2>
    <tag2>some te</tag2>
</record>
</Records>

The desired result should be such as:

(textual co, textual con, textual cont) , (some text, some tex, some te)

or in better format for further processing operations.


回答1:


You can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
    <xsl:apply-templates select="//text()"/>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

And for the update in the question, you can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
    <xsl:apply-templates/>
</xsl:template>
<xsl:template match="*">(<xsl:apply-templates select=".//text()"/>)<xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>



回答2:


Just an (updated) answer for the first part of the question - for the input in the question following XSLT

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" doctype-public="XSLT-compat" 
omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="record">
    <xsl:for-each select="child::*">
      <xsl:value-of select="normalize-space()"/>
      <xsl:if test="position()!= last()">, </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

has the result

textual content, textual content, textual content

The template matching record prints the value of each child element and adds , in case it's not the last element.




回答3:


This is shorter and more generic in that it does not name any elements. It also exploits XSLT's built in templates which provide the language with default behaviour that lessens the amount you have to code. Assuming XSLT 1.0

Below is a shorter variation of lingamurthyCS's answer that let's the built-in template rule handle the last text node. It's analogous to my previous answer.

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*[position() != last()]">
    <xsl:value-of select="."/><xsl:text>,</xsl:text>    
</xsl:template>
</xsl:transform>

However this particular job is better suited to XQuery.

Paste your XML into http://try.zorba.io/queries/xquery and just stick a /string-join(*,',') on the end of it like so

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>/string-join(*,',')

Exercise for the OP to translate that into XSLT 2.0 if that is what they are using.



来源:https://stackoverflow.com/questions/28032853/extracting-textual-content-from-xml-documents-using-xslt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!