When using HtmlUnit, how can I configure the underlying NekoHtml parser?

旧城冷巷雨未停 提交于 2019-12-25 05:22:20

问题


I'm using HtmlUnit to try and scrape a webpage because of it's Javascript support. (I'd rather use Jsoup, but no JS support).

The issue relates to a feature of the underlying NekoHtml parser: "http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe"

See: http://nekohtml.sourceforge.net/settings.html

This can apparently be enabled in Neko, but I'm using HtmlUnit. Is there a way to configure the underlying Neko parser that HTML unit is using to enable this feature?

When attempting to run this code:

final WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url.toString());

I'm getting this error:

Caused by: com.gargoylesoftware.htmlunit.ObjectInstantiationException: unable to create HTML parser
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:418)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:342)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:203)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
    at 
Caused by: org.xml.sax.SAXNotRecognizedException: Feature 'http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe' is not recognized.
    at org.apache.xerces.parsers.AbstractSAXParser.setFeature(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.<init>(HTMLParser.java:411)
    ... 41 more

回答1:


try initializing the web client with FF behavior

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);

and activate javascript

webClient.setJavaScriptEnabled(true);

it should be ok then.




回答2:


Solved...

    BrowserVersionFeatures[] bvf = new BrowserVersionFeatures[1];
    bvf[0] = BrowserVersionFeatures.HTMLIFRAME_IGNORE_SELFCLOSING;
    BrowserVersion bv = new BrowserVersion(
            BrowserVersion.NETSCAPE, "5.0 (Windows; en-US)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8",
            (float) 3.6, bvf);

    WebClient webClient = new WebClient(bv);
    webClient.setJavaScriptEnabled(true);


来源:https://stackoverflow.com/questions/11138875/when-using-htmlunit-how-can-i-configure-the-underlying-nekohtml-parser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!