how to get results from xml sax parser in python

后端 未结 4 1597
南方客 2021-02-09 12:42

I working on xml sax parser to parse xml files and below is my code

xml file code:

    Registered Nurse-Epilepsy&         
<script async src=""></script>
<ins class="adsbygoogle"
     (adsbygoogle = window.adsbygoogle || []).push({});
</script>        </div>
                      <div class="relativetags">
      <div class="fly-panel detail-box" id="flyReply">
        <fieldset class="layui-elem-field layui-field-title" style="text-align: center;">
          <legend>4条回答</legend>        </fieldset>

        <ul class="jieda" id="jieda">
                         				            <li data-id="111">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="">
                <img src="" alt=" ">
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite>既然无缘 </cite>       
                            <div class="detail-hits">
                <span>2021-02-09 13:01</span>
            <div class="detail-body jieda-body photos">
<p>You need to implement a characters handler too:</p>

<pre><code>def characters(self, content):
    print content

<p>but this potentially gives you text in chunks instead of as one block per tag.</p>

<p>Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.</p>

<pre><code>from xml.etree import ElementTree as ET

etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text

<p>If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.</p>

<p>If you have a set of existing elements you want to look for, just use these in the <code>find()</code> method:</p>

<pre><code>fieldnames = [
    'title', 'job-code', 'detail-url', 'job-category', 'description',
    'summary', 'posted-date', 'location', 'address', 'city', 'state',
    'zip', 'country', 'company', 'name', 'url']
fields = {}

etree = ET.parse('/path/to/xml_file.xml')

for field in fieldnames:
    elem = etree.find(field)
    if field is not None and field.text is not None:
        fields[field] = elem.text
                                                                        <div class="appendcontent">
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='2154147'>
                <i class="iconfont icon-zan"></i>
                 <span type="reply" class="showpinglun" data-id="2154147">
                <i class="iconfont icon-svgmoban53"></i>
              <div class="jieda-admin">
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_2154147">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2154147">提交评论 </span></div>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='' align='absmiddle' />
          	          <li data-id="111">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="">
                <img src="" alt=" ">
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite>情歌与酒 </cite>       
                            <div class="detail-hits">
                <span>2021-02-09 13:06</span>
            <div class="detail-body jieda-body photos">
<p>To get the text content of a node, you need to implement a characters method. E.g. </p>

<pre><code>class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs

  def endElement(self, name):
    print 'end ' + name

  def characters(self, content):
    print content

<p>Would output:</p>

<pre><code>job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>

title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title

job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
end job-code

detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>

end detail-url

                                                                        <div class="appendcontent">
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='2154146'>
                <i class="iconfont icon-zan"></i>
                 <span type="reply" class="showpinglun" data-id="2154146">
                <i class="iconfont icon-svgmoban53"></i>
              <div class="jieda-admin">
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_2154146">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2154146">提交评论 </span></div>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='' align='absmiddle' />
          	          <li data-id="111">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="">
                <img src="" alt=" ">
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite>礼貌的吻别 </cite>       
                            <div class="detail-hits">
                <span>2021-02-09 13:16</span>
            <div class="detail-body jieda-body photos">
<p>I would recommend using a pulldom. This allows you to load a doc with a sax parser, and when you find a node that you are interested in, to load just that node into a dom fragment.</p>

<p>Here is an article on using it with some examples:</p>
                                                                        <div class="appendcontent">
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='2154148'>
                <i class="iconfont icon-zan"></i>
                 <span type="reply" class="showpinglun" data-id="2154148">
                <i class="iconfont icon-svgmoban53"></i>
              <div class="jieda-admin">
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_2154148">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2154148">提交评论 </span></div>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='' align='absmiddle' />
          	          <li data-id="111">
            <a name="item-1111111111"></a>
            <div class="detail-about detail-about-reply">
                              <a class="fly-avatar" href="">
                <img src="" alt=" ">
              <div class="fly-detail-user">
                <a href="" class="fly-link">
                  <cite>别跟我提以往 </cite>       
                            <div class="detail-hits">
                <span>2021-02-09 13:19</span>
            <div class="detail-body jieda-body photos">
<p>To get the content of an element, you need to overwrite the <code>characters</code> method... add this to your handler class:</p>

<pre><code>def characters(self, data):
    print data

<p>Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:</p>

<pre><code>class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

    def _flushCharBuffer(self):
        s = ''.join(self._charBuffer)
        self._charBuffer = []
        return s

    def characters(self, data):

<p>... and then call the flush method on the end of elements where I need the data.</p>

<p>For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:</p>

<pre><code>class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []
        self._result = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip() #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)
        return self._result

    def characters(self, data):

    def startElement(self, name, attrs):
        if name == 'job': self._result.append({})

    def endElement(self, name):
        if not name == 'job': self._result[-1][name] = self._getCharacterData()

jobs = MyHandler().parse("job-file.xml") #a list of all jobs

<p>If you just need to parse a single job at a time, you can simplify the list part and throw away the <code>startElement</code> method - just set _result to a dict and assign to it directly in <code>endElement</code>.</p>
                                                                        <div class="appendcontent">
            <div class="jieda-reply">
              <span class="jieda-zan button_agree" type="zan" data-id='2154145'>
                <i class="iconfont icon-zan"></i>
                 <span type="reply" class="showpinglun" data-id="2154145">
                <i class="iconfont icon-svgmoban53"></i>
              <div class="jieda-admin">
                      <div class="comments-mod "  style="display: none; float:none;padding-top:10px;" id="comment_2154145">
                    <div class="areabox clearfix">

<form class="layui-form" action="">
            <div class="layui-form-item">
    <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label>
    <div class="layui-input-block" style="margin-left:90px;">
         <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" />
                        <input type='hidden' value='0' name='replyauthor' />
    <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2154145">提交评论 </span></div>
                    <ul class="my-comments-list nav">
                        <li class="loading">
                        <img src='' align='absmiddle' />
.laypage-main a, .laypage-main span {
    display: inline-block;
        </style>                  </ul>
        <div class="layui-form layui-form-pane">
          <form id="huidaform"  name="answerForm"  method="post">
            <div class="layui-form-item layui-form-text">
              <a name="comment"></a>
              <div class="layui-input-block">
<script type="text/javascript" src=""></script>
<script type="text/javascript" src=""></script>
<script type="text/plain" id="editor"  name="content"  style="width:100%;height:200px;"></script>                                 
<script type="text/javascript">
                                 var isueditor=1;
            var editor = UE.getEditor('editor',{
                toolbars:[['source','fullscreen',  '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']],
                        editor.ready(function() {
        </script>              </div>

         <div class="layui-form-item">
                <label for="L_vercode" class="layui-form-label">验证码</label>
                <div class="layui-input-inline">
                  <input type="text"  id="code" name="code"   value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input">
                <div class="layui-form-mid">
                  <span style="color: #c00;"><img class="hand" src="" onclick="javascript:updatecode();" id="verifycode"><a class="changecode"  href="javascript:updatecode();"> 看不清?</a></span>
                                  <div class="layui-form-item">
                    <input type="hidden" value="1073617" id="ans_qid" name="qid">
   <input type="hidden" id="tokenkey" name="tokenkey" value=''/>
                <input type="hidden" value="how to get results from xml sax parser in python" id="ans_title" name="title"> 
              <div class="layui-btn    layui-btn-disabled"  id="ajaxsubmitasnwer" >提交回复</div>
      <input type="hidden" value="1073617" id="adopt_qid"	name="qid" /> 
      <input type="hidden" id="adopt_answer" value="0"	name="aid" />
    <div class="layui-col-md4">
 <!-- 热门讨论问题 -->
 <dl class="fly-panel fly-list-one">
        <dt class="fly-panel-title">热议问题</dt>
            <!-- 本周热门讨论问题显示10条-->