问题
I have an NSScanner
object that scans through HTML documents for paragraph tags. It seems like the scanner stops at the first result it finds, but I need all the results in an array.
How can my code be improved to go through an entire document?
- (NSArray *)getParagraphs:(NSString *) html
{
NSScanner *theScanner;
NSString *text = nil;
theScanner = [NSScanner scannerWithString: html];
NSMutableArray*paragraphs = [[NSMutableArray alloc] init];
// find start of tag
[theScanner scanUpToString: @"<p>" intoString: NULL];
if ([theScanner isAtEnd] == NO) {
NSInteger newLoc = [theScanner scanLocation] + 10;
[theScanner setScanLocation: newLoc];
// find end of tag
[theScanner scanUpToString: @"</p>" intoString: &text];
[paragraphs addObject:text];
}
return text;
}
回答1:
Disclaimer: To parse HTML, it's better to use a HTML parser like libxml's HTML 4 parser, especially to deal with arbitrary possibly malformed HTML. Anyway, since the question asks how to improve existing code using NSParser
, I provide the following example. This will work in most cases but there are some corner cases where it won't. For seriuos HTML parsing, use a HTML parser.
Iterate until the scanner has exhausted all characters:
NSScanner* scanner = [NSScanner scannerWithString:html];
NSMutableArray *paragraphs = [[NSMutableArray alloc] init];
[scanner scanUpToString:@"<p" intoString:nil];
while (![scanner isAtEnd]) {
[scanner scanUpToString:@">" intoString:nil];
[scanner scanString:@">" intoString:nil];
NSString * text = nil;
[scanner scanUpToString:@"</p>" intoString:&text];
if (text) { // if html contains empty paragraphs <p></p>, text could be nil
[paragraphs addObject:text];
}
[scanner scanUpToString:@"<p" intoString:nil];
}
...
[paragraphs release];
回答2:
Do not use a scanner to parse HTML (and don't use regular expressions, either.... oh, the pain)*. The whole point of HTML is that it is a structured document that is designed to be traversed as a tree of nodes or object. Pretty much the entire DOM [Document Object Model] based industry is built around this.
Just use an XML parser as [well structured HTML is really just XML anyway]. NSXMLDocument (or -- if you need event driven -- NSXMLParser) will work grand.
Or, if you have to deal with malformed HTML (i.e. arbitrary server sewage), use a proper HTML parser.
This question/answer describes exactly that, with a solid example.
*Not to mention that parsing HTML is a "solved problem" in the industry. There is no need to roll a new one.
来源:https://stackoverflow.com/questions/6323677/nsscanner-loop-question