NSXMLDocumentTidyHTML doesn't tidy some XHTML validation errors

风流意气都作罢 提交于 2020-01-13 06:11:47

问题


I want to grab text from a list of web pages. I've done a bit of experimenting and found that the best way for my needs is via WebKit.

Once the source of the page has been grabbed, I want to strip out all the HTML tags, by using the technique in this comment.

Here's my code:

- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame {
    if(frame == [sender mainFrame]) {
        NSString *content = [[[[sender mainFrame] dataSource] representation] documentSource];
        NSXMLDocument *theDocument = [[NSXMLDocument alloc] initWithXMLString:content options:NSXMLDocumentTidyHTML error:&theError];
        NSString *theXSLTString = @"<?xml version='1.0' encoding='utf-8'?>\n<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml='http://www.w3.org/1999/xhtml'>\n<xsl:output method='text'/>\n<xsl:template match='xhtml:head'></xsl:template>\n<xsl:template match='xhtml:script'></xsl:template>\n</xsl:stylesheet>";
        NSData *theData = [theDocument objectByApplyingXSLTString:theXSLTString arguments:nil error:&theError];
        NSString *theString = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];
    }
}

This works fine on most pages. However, if a page doesn't validate correctly as XHTML, I sometimes get an error from my initWithXMLString: method.

That's fair enough - I'm asking it to tidy up the XHTML, so I'd expect it to report what problems it's encountered. But if there's a problem with the validation, it returns nil and an error rather than actually tidying up the XHTML.

One specific page that's causing the problem is the Ruby class documentation.

I've found that the excellent third party HTML tidy application can clean up this XHTML fine, but I'd expect NSXMLDocumentTidyHTML to be able to just add some quotes around cellpadding values. It's a fairly basic cleanup operation. And I'm not keen to add another dependency into my code base.

Is there something I'm missing with the way Cocoa cleans up XHTML? Or do I just need to bite the bullet and use HTML Tidy instead in my code?


回答1:


XHTML documents are treated as XML, so you may have better luck with the NSXMLDocumentTidyXML flag.



来源:https://stackoverflow.com/questions/1032241/nsxmldocumenttidyhtml-doesnt-tidy-some-xhtml-validation-errors

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!