How to parse XML in Bash?

前端 未结 15 2661
情深已故
情深已故 2020-11-22 03:11

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path=\'/html/head/title\' |
sed -e \'s%(^|</title&>         
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<ins class="adsbygoogle"
     style="display:block"
     data-ad-client="ca-pub-5408099190056760"
     data-ad-slot="7305827575"
     data-ad-format="auto"
     data-full-width-responsive="true"></ins>
<script>
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
      </div>
      
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>15条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                    <li data-id="111" class="jieda-daan">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                         <a class="fly-avatar" href="">
                <img src="https://www.e-learn.cn/qa/data/avatar/000/00/00/small_000000009.jpg" alt=" -上瘾入骨i ">
              </a>
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite> -上瘾入骨i</cite>
                                             
                </a>
                
                <span>(楼主)</span>
            
              </div>              <div class="detail-hits">
                <span>2020-11-22 03:39</span>
              </div>

            </div>
            <div class="detail-body jieda-body photos">
              <p>          
<p>While it seems like "never parse XML, JSON... from bash without a proper tool" is sound advice, I disagree. If this is side job, it is waistfull to look for the proper tool, then learn it... Awk can do it in minutes. My programs have to work on all above mentioned and more kinds of data. Hell, I do not want to test 30 tools to parse 5-7-10 different formats I need if I can awk the problem in minutes. I do not care about XML, JSON or whatever! I need a single solution for all of them.</p>

<p>As an example: my SmartHome program runs our homes. While doing it, it reads plethora of data in too many different formats I can not control. I never use dedicated, proper tools since I do not want to spend more than minutes on reading the data I need. With FS and RS adjustments, this awk solution works perfectly for any textual format. But, it may not be the proper answer when your primary task is to work primarily with loads of data in that format!</p>

<p>The problem of parsing XML from bash I faced yesterday. Here is how I do it for any hierarchical data format. As a bonus - I assign data directly to the variables in a bash script.</p>

<p>To make thins easier to read, I will present solution in stages. From the OP test data, I created a file: test.xml</p>

<p>Parsing said XML in bash and extracting the data in 90 chars:</p>

<pre><code>awk 'BEGIN { FS="<|>"; RS="\n" }; /host|username|password|dbname/ { print $2, $4 }' test.xml
</code></pre>

<p>I normally use more readable version since it is easier to modify in real life as I often need to test differently:</p>

<pre><code>awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2,$4}' test.xml
</code></pre>

<p>I do not care how is the format called. I seek only the simplest solution. In this particular case, I can see from the data that newline is the record separator (RS) and <> delimit fields (FS). In my original case, I had complicated indexing of 6 values within two records, relating them, find when the data exists plus fields (records) may or may not exist. It took 4 lines of awk to solve the problem perfectly. So, adapt idea to each need before using it!</p>

<p>Second part simply looks it there is wanted string in a line (RS) and if so, prints out needed fields (FS). The above took me about 30 seconds to copy and adapt from the last command I used this way (4 times longer). And that is it! Done in 90 chars.</p>

<p>But, I always need to get the data neatly into variables in my script. I first test the constructs like so:</p>

<pre><code>awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml
</code></pre>

<p>In some cases I use printf instead of print. When I see everything looks well, I simply finish assigning values to variables. I know many think "eval" is "evil", no need to comment :) Trick works perfectly on all four of my networks for years. But keep learning if you do not understand why this may be bad practice! Including bash variable assignments and ample spacing, my solution needs 120 chars to do everything.</p>

<pre><code>eval $( awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml ); echo "host: $host, username: $username, password: $password dbname: $dbname"
</code></pre>
    </p>
             <div class="appendcontent">
                                                        </div>
            </div>
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='59705'>
                <i class="iconfont icon-zan"></i>
                <em>0</em>
              </span>
                   <span type="reply" class="showpinglun" data-id="59705">
                <i class="iconfont icon-svgmoban53"></i>
               讨论(0)
              </span>
                                                  
              
              <div class="jieda-admin">
                          
             
       
          
              </div>
                                       <div class="noreplaytext bb">
<center><div>   <a href="https://www.e-learn.cn/qa/q-19606.html">  查看其它15个回答
</a>
</div></center>
</div>            </div>
                         <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_59705">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
               
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    </div>
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="59705">提交评论 </span></div>
  </div>
  
</form>
                    </div>
                    <hr>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />
                         加载中...
                        </li>
                    </ul>
                </div>
          </li>
                              			
        </ul>
        
        <div class="layui-form layui-form-pane">
          <form id="huidaform"  name="answerForm"  method="post">
            
            <div class="layui-form-item layui-form-text">
              <a name="comment"></a>
              <div class="layui-input-block">
            
    
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script>
<script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script>
<script type="text/plain" id="editor"  name="content"  style="width:100%;height:200px;"></script>                                 
<script type="text/javascript">
                                 var isueditor=1;
            var editor = UE.getEditor('editor',{
                //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个
                toolbars:[['source','fullscreen',  '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']],
            
                initialContent:'',
                //关闭字数统计
                wordCount:false,
                zIndex:2,
                //关闭elementPath
                elementPathEnabled:false,
                //默认的编辑区域高度
                initialFrameHeight:250
                //更多其他参数,请参考ueditor.config.js中的配置项
                //更多其他参数,请参考ueditor.config.js中的配置项
            });
                        editor.ready(function() {
            	editor.setDisabled();
            	});
                            $("#editor").find("*").css("max-width","362px");
        </script>              </div>
            </div>
                          
    

        
         <div class="layui-form-item">
                <label for="L_vercode" class="layui-form-label">验证码</label>
                <div class="layui-input-inline">
                  <input type="text"  id="code" name="code"   value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input">
                </div>
                <div class="layui-form-mid">
                  <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode"  href="javascript:updatecode();"> 看不清?</a></span>
                </div>
              </div>
                                  <div class="layui-form-item">
                    <input type="hidden" value="19606" id="ans_qid" name="qid">
   <input type="hidden" id="tokenkey" name="tokenkey" value=''/>
                <input type="hidden" value="How to parse XML in Bash?" id="ans_title" name="title"> 
             
              <div class="layui-btn    layui-btn-disabled"  id="ajaxsubmitasnwer" >提交回复</div>
            </div>
          </form>
        </div>
      </div>
      <input type="hidden" value="19606" id="adopt_qid"	name="qid" /> 
      <input type="hidden" id="adopt_answer" value="0"	name="aid" />
    </div>
    <div class="layui-col-md4">
          
 <!-- 热门讨论问题 -->
     
 <dl class="fly-panel fly-list-one">
        <dt class="fly-panel-title">热议问题</dt>
            <!-- 本周热门讨论问题显示10条-->