SAX Parser : Retrieving HTML tags from XML

半城伤御伤魂 提交于 2019-12-06 05:45:28
UVM

You can parse html after all html is also xml.There is a link similar to this in stackoverflow.You can try this How to parse the html content in android using SAX PARSER

On start element, if the element is content, your temp Str buffer should be initialized. else if content already started, capture the current start element and its attributes and update that to the temp Str buffer.

On characters, if content is started, add the charecters to the current string buffer.

On end element if content is started, Capture the end node and add to string buffer.

My Assumption:

The xml will have only one content tag.

Don Roby

If the html is actually xhtml, you can parse it using SAX and extract the xhtml contents of the <content> tag, but not nearly this easily.

You would have to make your handler actually respond to the events that will be raised by all the xhtml tags inside the <content> tag, and either build something resembling a DOM structure, which you could then serialize back out to xml form, or on-the-fly directly write into an xml string buffer replicating the contents.

If you modify your xml so that the html inside the content tag is wrapped in a CDATA element as suggested in How to parse the html content in android using SAX PARSER, something not too far from your code should indeed work.

But you can't just put the contents into your String tempStr variable in the characters method as you're doing. You'll need to have a startElement method that initializes a buffer for the string on seeing the <content> tag, collect into that buffer in the characters method, and put the result somewhere in the endElement for the <content> tag.

Khajavi

I find the solution in this way:

Note: In this solution I want to get the html content between <chapter> tags (<chapter> ... html content ... </chapter>)

DefaultHandler handler = new DefaultHandler() {

    boolean chap = false;

    public char[] temp;
    int chapterStart;
    int chapterEnd;

    public void startElement(String uri, String localName,
            String qName, Attributes attributes)
            throws SAXException {

            System.out.println("Start Element :" + qName);

            if (qName.equalsIgnoreCase("chapter")) {
                chap = true;
            }

        }

        public void endElement(String uri, String localName,
            String qName) throws SAXException {

            if (qName.equalsIgnoreCase("chapter")) {
                System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));

            }
            System.out.println("End Element :" + qName);

        }

        public void characters(char ch[], int start, int length)
                throws SAXException {

            if (chap) {
                temp = ch;
                chapterStart = start;
                chap = false;
            }
            chapterEnd = start + length;

        }

    };

Update:

My code have a bug. because the length of ch[] in DocumentHandler varies in different situation!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!