NSPredicate versus NSString: Which is better/faster for finding superstrings?

后端 未结 2 1159
夕颜
夕颜 2021-02-10 16:16

I have a large number of strings that I\'m searching to see if a given substring exists. It seems there are two reasonable ways to do this.

Option 1: Use the NSStr

2条回答
  •  攒了一身酷
    2021-02-10 16:22

    You should benchmark and time any solution that uses NSPredicate because in my experience NSPredicate can be very slow.

    For simplicity, I would go with a simple for(NSString *string in stringsArray) { } type of loop. The loop body would contain a simple rangeOfSubstring check. You might be able improve the performance of this by a few percent by using CFStringFind(), but you'll only see a benefit if you're searching through lots of strings. The advantage to using CFStringFind() is that you can avoid the (very small) Objective-C message dispatch overhead. Again, it's usually only a win to switch to that when you're search "a lot" of strings (for some always changing value of "a lot"), and you should always benchmark to be sure. Prefer the simpler Objective-C rangeOfString: way if you can.

    A much more complicated approach is to use the ^Blocks feature with the NSEnumerationConcurrent option. NSEnumerationConcurrent is only a hint that you'd like the enumeration to happen concurrently if possible, and an implementation is free to ignore this hint if it can't support concurrent enumeration. However, your standard NSArray is most likely going to implement concurrent enumeration. In practice, this has the effect of dividing up all the objects in the NSArray and splitting them across the available CPU's. You need to be careful about how to mutate state and objects that is accessed by the ^Block across multiple threads. Here's one potential way of doing it:

    // Be sure to #include 
    
    __block volatile OSSpinLock spinLock = OS_SPINLOCK_INIT;
    __block NSMutableArray *matchesArray = [NSMutableArray array];
    
    [stringsToSearchArray enumerateObjectsWithOptions:NSEnumerationConcurrent usingBlock:^(id obj, NSUInteger idx, BOOL *stop) {
        NSRange matchedRange = [obj rangeOfString:@"this"];
        if(matchedRange.location != NSNotFound) {
          OSSpinLockLock((volatile OSSpinLock * volatile)&spinLock);
          [matchesArray addObject:obj];
          OSSpinLockUnlock((volatile OSSpinLock * volatile)&spinLock);
        }
      }];
    
    // At this point, matchesArray will contain all the strings that had a match.
    

    This uses a lightweight OSSpinLock to make sure only one thread has access to and updates matchesArray at a time. You can use the same CFStringFind() suggestion from above here as well.

    Also, you should be aware that rangeOfString: won't, by itself, match on "word boundaries". In the example above, I used the word this, which would match the string A paleolithist walked in to the bar... even though it does not contain the word this.

    The simplest solution to this little wrinkle is to use an ICU regular expression and take advantage of it's "enhanced word breaking" functionality. To do this, you have a few options:

    • NSRegularExpression, currently only available on >4.2 or >4.3 iOS (I forget which).
    • RegexKitLite, via RegexKitLite-4.0.tar.bz2
    • NSPredicate, via SELF MATCHES '(?w)\b...\b'. The advantage to this is that it requires nothing extra (i.e., RegexKitLite) and is available on all(?) versions of Mac OS X, and iOS > 3.0.

    The following code shows how to use the enhanced word breaking functionality in ICU regular expressions via NSPredicate:

    NSString *searchForString = @"this";
    NSString *regexString = [NSString stringWithFormat:@".*(?w:\\b\\Q%@\\E\\b).*", searchForString];
    NSPredicate *wordBoundaryRegexPredicate = [NSPredicate predicateWithFormat:@"SELF MATCHES %@", regexString];
    NSArray *matchesArray = [stringsToSearchArray filteredArrayUsingPredicate:wordBoundaryRegexPredicate];
    

    You can make the search case insensitive by replacing the (?w: in regexString with (?wi:.

    The regex, if you're interested, basically says

    • .*(?w:...).* says "match anything up to and after the (?w:...) part" (i.e., we're only interested in the (?w:...) part).
    • (?w:...) says "Turn on the ICU enhanced word breaking / finding feature inside the parenthesis".
    • \\b...\\b (which is really only a single backslash, any backslash has to be backslash escaped when it's inside a @"" string) says "Match at a word boundary".
    • \\Q...\\E says "Treat the text starting immediately after \Q and up to \E as literal text (think "Quote" and "End")". In other words, any characters in the "quoted literal text" do not have their special regex meaning.

    The reason for the \Q...\E is that you probably want to match the literal characters in searchForString. Without this, searchForString would be treated as part of the regex. As an example, if searchForString was this?, then without \Q...\E it would not match the literal string this?, but either thi or this, which is probably not what you want. :)

提交回复
热议问题