I\'m new to Scala, so I may be off base on this, I want to know if the problem is my code. Given the Scala file httpparse, simplified to:
object Http {
import
Without addressing, for now, the problem, what do you expect to happen if the function request return false below?
def fetchAndParseURL(URL:String) = {
val (true, body) = Http request(URL)
What will happen is that an exception will be thrown. You could rewrite it this way, though:
def fetchAndParseURL(URL:String) = (Http request(URL)) match {
case (true, body) =>
val xml = XML.load(body)
"True"
case _ => "False"
}
Now, to fix the XML parsing problem, we'll disable DTD loading in the parser, as suggested by others:
def fetchAndParseURL(URL:String) = (Http request(URL)) match {
case (true, body) =>
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
val MyXML = XML.withSAXParser(f.newSAXParser())
val xml = MyXML.load(body)
"True"
case _ => "False"
}
Now, I put that MyXML stuff inside fetchAndParseURL just to keep the structure of the example as unchanged as possible. For actual use, I'd separate it in a top-level object, and make "parser" into a def instead of val, to avoid problems with mutable parsers:
import scala.xml.Elem
import scala.xml.factory.XMLLoader
import javax.xml.parsers.SAXParser
object MyXML extends XMLLoader[Elem] {
override def parser: SAXParser = {
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
f.newSAXParser()
}
}
Import the package it is defined in, and you are good to go.
I've bumped into the SAME issue, and I haven't found an elegant solution (I'm thinking into posting the question to the Scala mailing list) Meanwhile, I found a workaround: implement your own SAXParserFactoryImpl so you can set the f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true); property. The good thing is it doesn't require any code change to the Scala code base (I agree that it should be fixed, though). First I'm extending the default parser factory:
package mypackage;
public class MyXMLParserFactory extends SAXParserFactoryImpl {
public MyXMLParserFactory() throws SAXNotRecognizedException, SAXNotSupportedException, ParserConfigurationException {
super();
super.setFeature("http://xml.org/sax/features/validation", false);
super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false);
super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
}
}
Nothing special, I just want the chance to set the property.
(Note: that this is plain Java code, most probably you can write the same in Scala too)
And in your Scala code, you need to configure the JVM to use your new factory:
System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory");
Then you can call XML.load without validation
There are two problems with what you are trying to do:
The second problem isn't really possible to fix in your scala code. Even once you get around the dtd problem, you'll find that the source just isn't valid XML (empty tags aren't closed properly, for example).
You have to either parse the page with something besides an XML parser, or investigate using a utility like tidy to convert the html to xml.
GClaramunt's solution worked wonders for me. My Scala conversion is as follows:
package mypackage
import org.xml.sax.{SAXNotRecognizedException, SAXNotSupportedException}
import com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
import javax.xml.parsers.ParserConfigurationException
@throws(classOf[SAXNotRecognizedException])
@throws(classOf[SAXNotSupportedException])
@throws(classOf[ParserConfigurationException])
class MyXMLParserFactory extends SAXParserFactoryImpl() {
super.setFeature("http://xml.org/sax/features/validation", false)
super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false)
super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
}
As mentioned his the original post, it is necessary to place the following line in your code somewhere:
System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory")
This is a scala problem. Native Java has an option to disable loading the DTD:
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
There are no equivalent in scala.
If you somewhat want to fix it yourself, check scala/xml/parsing/FactoryAdapter.scala
and put the line in
278 def loadXML(source: InputSource): Node = {
279 // create parser
280 val parser: SAXParser = try {
281 val f = SAXParserFactory.newInstance()
282 f.setNamespaceAware(false)
<-- insert here
283 f.newSAXParser()
284 } catch {
285 case e: Exception =>
286 Console.err.println("error: Unable to instantiate parser")
287 throw e
288 }
My knowledge of Scala is pretty poor, but couldn't you use ConstructingParser instead?
val xml = new java.io.File("xmlWithDtd.xml")
val parser = scala.xml.parsing.ConstructingParser.fromFile(xml, true)
val doc = parser.document()
println(doc.docElem)