Is Scala/Java not respecting w3 “excess dtd traffic” specs?

后端 未结 9 1188
青春惊慌失措
青春惊慌失措 2021-02-01 07:40

I\'m new to Scala, so I may be off base on this, I want to know if the problem is my code. Given the Scala file httpparse, simplified to:

object Http {
   import         


        
相关标签:
9条回答
  • 2021-02-01 08:21

    Without addressing, for now, the problem, what do you expect to happen if the function request return false below?

    def fetchAndParseURL(URL:String) = {      
      val (true, body) = Http request(URL)
    

    What will happen is that an exception will be thrown. You could rewrite it this way, though:

    def fetchAndParseURL(URL:String) = (Http request(URL)) match {      
      case (true, body) =>      
        val xml = XML.load(body)
        "True"
      case _ => "False"
    }
    

    Now, to fix the XML parsing problem, we'll disable DTD loading in the parser, as suggested by others:

    def fetchAndParseURL(URL:String) = (Http request(URL)) match {      
      case (true, body) =>
        val f = javax.xml.parsers.SAXParserFactory.newInstance()
        f.setNamespaceAware(false)
        f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
        val MyXML = XML.withSAXParser(f.newSAXParser())
        val xml = MyXML.load(body)
        "True"
      case _ => "False"
    }
    

    Now, I put that MyXML stuff inside fetchAndParseURL just to keep the structure of the example as unchanged as possible. For actual use, I'd separate it in a top-level object, and make "parser" into a def instead of val, to avoid problems with mutable parsers:

    import scala.xml.Elem
    import scala.xml.factory.XMLLoader
    import javax.xml.parsers.SAXParser
    object MyXML extends XMLLoader[Elem] {
      override def parser: SAXParser = {
        val f = javax.xml.parsers.SAXParserFactory.newInstance()
        f.setNamespaceAware(false)
        f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
        f.newSAXParser()
      }
    }
    

    Import the package it is defined in, and you are good to go.

    0 讨论(0)
  • 2021-02-01 08:27

    I've bumped into the SAME issue, and I haven't found an elegant solution (I'm thinking into posting the question to the Scala mailing list) Meanwhile, I found a workaround: implement your own SAXParserFactoryImpl so you can set the f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true); property. The good thing is it doesn't require any code change to the Scala code base (I agree that it should be fixed, though). First I'm extending the default parser factory:

    package mypackage;
    
    public class MyXMLParserFactory extends SAXParserFactoryImpl {
          public MyXMLParserFactory() throws SAXNotRecognizedException, SAXNotSupportedException, ParserConfigurationException {
            super();
            super.setFeature("http://xml.org/sax/features/validation", false);
            super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false); 
            super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false); 
            super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); 
          } 
        }
    

    Nothing special, I just want the chance to set the property.

    (Note: that this is plain Java code, most probably you can write the same in Scala too)

    And in your Scala code, you need to configure the JVM to use your new factory:

    System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory");
    

    Then you can call XML.load without validation

    0 讨论(0)
  • 2021-02-01 08:27

    There are two problems with what you are trying to do:

    • Scala's xml parser is trying to physically retrieve the DTD when it shouldn't. J-16 SDiZ seems to have some advice for this problem.
    • The Stack overflow page you are trying to parse isn't XML. It's Html4 strict.

    The second problem isn't really possible to fix in your scala code. Even once you get around the dtd problem, you'll find that the source just isn't valid XML (empty tags aren't closed properly, for example).

    You have to either parse the page with something besides an XML parser, or investigate using a utility like tidy to convert the html to xml.

    0 讨论(0)
  • 2021-02-01 08:32

    GClaramunt's solution worked wonders for me. My Scala conversion is as follows:

    package mypackage
    import org.xml.sax.{SAXNotRecognizedException, SAXNotSupportedException}
    import com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
    import javax.xml.parsers.ParserConfigurationException
    
    @throws(classOf[SAXNotRecognizedException])
    @throws(classOf[SAXNotSupportedException])
    @throws(classOf[ParserConfigurationException])
    class MyXMLParserFactory extends SAXParserFactoryImpl() {
        super.setFeature("http://xml.org/sax/features/validation", false)
        super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
        super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false)
        super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
    }
    

    As mentioned his the original post, it is necessary to place the following line in your code somewhere:

    System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory")
    
    0 讨论(0)
  • 2021-02-01 08:35

    This is a scala problem. Native Java has an option to disable loading the DTD:

    f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
    

    There are no equivalent in scala.

    If you somewhat want to fix it yourself, check scala/xml/parsing/FactoryAdapter.scala and put the line in

    278   def loadXML(source: InputSource): Node = {
    279     // create parser
    280     val parser: SAXParser = try {
    281       val f = SAXParserFactory.newInstance()
    282       f.setNamespaceAware(false)
    

    <-- insert here

    283       f.newSAXParser()
    284     } catch {
    285       case e: Exception =>
    286         Console.err.println("error: Unable to instantiate parser")
    287         throw e
    288     }
    
    0 讨论(0)
  • 2021-02-01 08:37

    My knowledge of Scala is pretty poor, but couldn't you use ConstructingParser instead?

      val xml = new java.io.File("xmlWithDtd.xml")
      val parser = scala.xml.parsing.ConstructingParser.fromFile(xml, true)
      val doc = parser.document()
      println(doc.docElem)
    
    0 讨论(0)
提交回复
热议问题