How to parse XML in Bash?

前端 未结 15 2616
情深已故
情深已故 2020-11-22 03:11

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path=\'/html/head/title\' |
sed -e \'s%(^|</title&         
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-5408099190056760"
     data-ad-slot="7305827575"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
                      <div class="relativetags">
              相关标签:
       </div>
      </div>
      
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>15条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                         				            <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-104.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/01/small_000000104.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-104.html" class="fly-link">
                  <cite>走了就别回头了 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:22</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.</p>

<p>Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.</p>

<h1>Example 1</h1>

<pre class="lang-python prettyprint-override"><code>#!/usr/bin/env python
import sys
from lxml import etree

tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]

#  a hack allowing to access the
#  default namespace (if defined) via the 'p:' prefix    
#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
#  an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
    ns['p'] = ns.pop(None)
#   end of hack    

for e in tree.xpath(xpath_expression, namespaces=ns):
    if isinstance(e, str):
        print(e)
    else:
        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
</code></pre>

<p><code>lxml</code> can be installed with <code>pip install lxml</code>. On ubuntu you can use <code>sudo apt install python-lxml</code>.</p>

<h3>Usage</h3>

<pre class="lang-bash prettyprint-override"><code>python xpath.py myfile.xml "//mynode"
</code></pre>

<p><code>lxml</code> also accepts a URL as input:</p>

<pre class="lang-bash prettyprint-override"><code>python xpath.py http://www.feedforall.com/sample.xml "//link"
</code></pre>

<blockquote>
  <p><strong>Note</strong>: If your XML has a default namespace with no prefix (e.g. <code>xmlns=http://abc...</code>) then you have to use the <code>p</code> prefix (provided by the 'hack') in your expressions, e.g. <code>//p:module</code> to get the modules from a <code>pom.xml</code> file. In case the <code>p</code> prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.</p>
</blockquote>

<hr>

<h1>Example 2</h1>

<p>A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (<code>module</code>) is prefixed with the default namespace <code>{http://maven.apache.org/POM/4.0.0}</code>:</p>

<p><em>pom.xml</em>:
</p>

<pre><code><?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modules>
        <module>cherries</module>
        <module>bananas</module>
        <module>pears</module>
    </modules>
</project>
</code></pre>

<p><em>module_extractor.py</em>:
</p>

<pre><code>from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
    print(e.text)
</code></pre>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59702'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59702">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59702">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59702">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-27.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000027.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-27.html" class="fly-link">
                  <cite>悲&欢浪女 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:26</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet. </p>

<p>The title can be read like:</p>

<pre><code>xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
</code></pre>

<p>And it also has a cool feature to export multiple variables to bash. For example</p>

<pre><code>eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
</code></pre>

<p>sets <code>$title</code> to the title and <code>$imgcount</code> to the number of images in the file, which should be as flexible as parsing it directly in bash.</p>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59699'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59699">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59699">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59699">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-32.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000032.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-32.html" class="fly-link">
                  <cite>轮回少年 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:27</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>Yuzem's method can be improved by inversing the order of the <code><</code> and <code>></code> signs in the <code>rdom</code> function and the variable assignments, so that:</p>

<pre><code>rdom () { local IFS=\> ; read -d \< E C ;}
</code></pre>

<p>becomes:</p>

<pre><code>rdom () { local IFS=\< ; read -d \> C E ;}
</code></pre>

<p>If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the <code>while</code> loop.</p>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59703'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59703">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59703">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59703">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-29.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000029.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-29.html" class="fly-link">
                  <cite>甜味超标 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:33</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>You can do that very easily using only bash.
You only have to add this function:</p>

<pre><code>rdom () { local IFS=\> ; read -d \< E C ;}
</code></pre>

<p>Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.</p>

<p>For example, to do what you wanted to do:</p>

<pre><code>while rdom; do
    if [[ $E = title ]]; then
        echo $C
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
</code></pre>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59692'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59692">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59692">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59692">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-79.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000079.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-79.html" class="fly-link">
                  <cite>北海茫月 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:33</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>Check out <strong>XML2</strong> from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.</p>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59696'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59696">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59696">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59696">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          <li data-id="111">
            <a name="item-1111111111"></a>
           
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="https://www.e-learn.cn/qa/u-119.html">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/01/small_000000119.jpg" alt=" ">
              </a>
              <div class="fly-detail-user">
                <a href="https://www.e-learn.cn/qa/u-119.html" class="fly-link">
                  <cite>無奈伤痛 </cite>       
                </a>
              </div>
                            <div class="detail-hits">
                <span>2020-11-22 03:33</span>
              </div>
            </div>
            <div class="detail-body jieda-body photos">
                                                                       
<p>This works if you are wanting XML attributes:</p>

<pre><code>$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

$ . ./alfa.sh

$ echo "$stream"
H264_400.mp4
</code></pre>
                                                                        <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59704'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                 <span type="reply" class="showpinglun" data-id="59704">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
              
                                                   
              <div class="jieda-admin">
                                                            </div>
            </div>
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59704">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59704">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
          	          
   <div style="text-align: center">
          <div class="laypage-main">
     <strong>1</strong>
<a href="https://www.e-learn.cn/qa/q-19606/2.html">2</a>
<a href="https://www.e-learn.cn/qa/q-19606/3.html">3</a>
<a class="n" href="https://www.e-learn.cn/qa/q-19606/2.html">下一页</a>
           
           </div>
        </div>
                <style>
   
.laypage-main a, .laypage-main span {
    display: inline-block;
}
        </style>                  </ul>
        
        <div class="layui-form layui-form-pane">
          <form id="huidaform"  name="answerForm"  method="post">
            
            <div class="layui-form-item layui-form-text">
              <a name="comment"></a>
              <div class="layui-input-block">
            
    
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script>
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script>
<script type="text/plain" id="editor"  name="content"  style="width:100%;height:200px;"></script>                                 
<script type="text/javascript">
                                 var isueditor=1;
            var editor = UE.getEditor('editor',{
                //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个
                toolbars:[['source','fullscreen',  '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']],
            
                initialContent:'',
                //关闭字数统计
                wordCount:false,
                zIndex:2,
                //关闭elementPath
                elementPathEnabled:false,
                //默认的编辑区域高度
                initialFrameHeight:250
                //更多其他参数,请参考ueditor.config.js中的配置项
                //更多其他参数,请参考ueditor.config.js中的配置项
            });
                        editor.ready(function() {
            	editor.setDisabled();
            	});
                            $("#editor").find("*").css("max-width","362px");
        </script>              </div>
            </div>
                          
    

        
         <div class="layui-form-item">
                <label for="L_vercode" class="layui-form-label">验证码</label>
                <div class="layui-input-inline">
                  <input type="text"  id="code" name="code"   value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input">
                </div>
                <div class="layui-form-mid">
                  <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode"  href="javascript:updatecode();"> 看不清?</a></span>
                </div>
              </div>
                                  <div class="layui-form-item">
                    <input type="hidden" value="19606" id="ans_qid" name="qid">
   <input type="hidden" id="tokenkey" name="tokenkey" value=''/>
                <input type="hidden" value="How to parse XML in Bash?" id="ans_title" name="title"> 
             
              <div class="layui-btn    layui-btn-disabled"  id="ajaxsubmitasnwer" >提交回复</div>
            </div>
          </form>
        </div>
      </div>
      <input type="hidden" value="19606" id="adopt_qid"	name="qid" /> 
      <input type="hidden" id="adopt_answer" value="0"	name="aid" />
    </div>
    <div class="layui-col-md4">
          
 <!-- 热门讨论问题 -->
     
 <dl class="fly-panel fly-list-one">
        <dt class="fly-panel-title">热议问题</dt>
            <!-- 本周热门讨论问题显示10条-->