Groovy XMLSlurper issue

浪子不回头ぞ 提交于 2019-12-11 07:00:06

问题


I want to parse with XmlSlurper a HTML document which I read using HTTPBuilder. Initialy I tried to do it this way:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
def xml = slurper.parse(response)

But it produces an exception:

java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

I found a workaround to provide cached DTD files. I found a simple implementation of class which should help here:

class CachedDTD {
/**
 * Return DTD 'systemId' as InputSource.
 * @param publicId
 * @param systemId
 * @return InputSource for locally cached DTD.
 */
  def static entityResolver = [
          resolveEntity: { publicId, systemId ->
            try {
              String dtd = "dtd/" + systemId.split("/").last()
              Logger.getRootLogger().debug "DTD path: ${dtd}"
              new org.xml.sax.InputSource(CachedDTD.class.getResourceAsStream(dtd))
            } catch (e) {
              //e.printStackTrace()
              Logger.getRootLogger().fatal "Fatal error", e
              null
            }
          }
  ] as org.xml.sax.EntityResolver

}

My package tree looks as shown below:

I modified also a little code for parsing response, so it looks like this:

def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
slurper.setEntityResolver(org.yuri.CachedDTD.entityResolver)
def xml = slurper.parse(response)

But now I'm getting java.net.MalformedURLException. Logged DTD path from CachedDTD entityResolver is org/yuri/dtd/xhtml1-transitional.dtd and I can't get it working...


回答1:


there is a HTML parse that you could use, in conjunction with XmlSlurper to address these problems

http://sourceforge.net/projects/nekohtml/

Sample useage here

http://groovy.codehaus.org/Testing+Web+Applications




回答2:


I was able to solve my parsing issue by using another XmlSlurper constructor:

public XmlSlurper(boolean validating, boolean namespaceAware, boolean allowDocTypeDeclaration)

like this:

def parser = new XmlSlurper(false, false, true)

In my XML case, disabling the validation (1st parameter false) and enabling the DOCTYPE declaration (3rd parameter true) did the trick.

Note:



来源:https://stackoverflow.com/questions/3745240/groovy-xmlslurper-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!