I have a large number of strings that I\'m searching to see if a given substring exists. It seems there are two reasonable ways to do this.
Option 1: Use the NSStr
You should benchmark and time any solution that uses NSPredicate
because in my experience NSPredicate
can be very slow.
For simplicity, I would go with a simple for(NSString *string in stringsArray) { }
type of loop. The loop body would contain a simple rangeOfSubstring
check. You might be able improve the performance of this by a few percent by using CFStringFind(), but you'll only see a benefit if you're searching through lots of strings. The advantage to using CFStringFind()
is that you can avoid the (very small) Objective-C message dispatch overhead. Again, it's usually only a win to switch to that when you're search "a lot" of strings (for some always changing value of "a lot"), and you should always benchmark to be sure. Prefer the simpler Objective-C rangeOfString:
way if you can.
A much more complicated approach is to use the ^Blocks feature with the NSEnumerationConcurrent
option. NSEnumerationConcurrent
is only a hint that you'd like the enumeration to happen concurrently if possible, and an implementation is free to ignore this hint if it can't support concurrent enumeration. However, your standard NSArray
is most likely going to implement concurrent enumeration. In practice, this has the effect of dividing up all the objects in the NSArray
and splitting them across the available CPU's. You need to be careful about how to mutate state and objects that is accessed by the ^Block across multiple threads. Here's one potential way of doing it:
// Be sure to #include
__block volatile OSSpinLock spinLock = OS_SPINLOCK_INIT;
__block NSMutableArray *matchesArray = [NSMutableArray array];
[stringsToSearchArray enumerateObjectsWithOptions:NSEnumerationConcurrent usingBlock:^(id obj, NSUInteger idx, BOOL *stop) {
NSRange matchedRange = [obj rangeOfString:@"this"];
if(matchedRange.location != NSNotFound) {
OSSpinLockLock((volatile OSSpinLock * volatile)&spinLock);
[matchesArray addObject:obj];
OSSpinLockUnlock((volatile OSSpinLock * volatile)&spinLock);
}
}];
// At this point, matchesArray will contain all the strings that had a match.
This uses a lightweight OSSpinLock
to make sure only one thread has access to and updates matchesArray
at a time. You can use the same CFStringFind()
suggestion from above here as well.
Also, you should be aware that rangeOfString:
won't, by itself, match on "word boundaries". In the example above, I used the word this
, which would match the string A paleolithist walked in to the bar...
even though it does not contain the word this
.
The simplest solution to this little wrinkle is to use an ICU regular expression and take advantage of it's "enhanced word breaking" functionality. To do this, you have a few options:
NSRegularExpression
, currently only available on >4.2 or >4.3 iOS (I forget which).NSPredicate
, via SELF MATCHES '(?w)\b...\b'
. The advantage to this is that it requires nothing extra (i.e., RegexKitLite) and is available on all(?) versions of Mac OS X, and iOS > 3.0.The following code shows how to use the enhanced word breaking functionality in ICU regular expressions via NSPredicate
:
NSString *searchForString = @"this";
NSString *regexString = [NSString stringWithFormat:@".*(?w:\\b\\Q%@\\E\\b).*", searchForString];
NSPredicate *wordBoundaryRegexPredicate = [NSPredicate predicateWithFormat:@"SELF MATCHES %@", regexString];
NSArray *matchesArray = [stringsToSearchArray filteredArrayUsingPredicate:wordBoundaryRegexPredicate];
You can make the search case insensitive by replacing the (?w:
in regexString
with (?wi:
.
The regex, if you're interested, basically says
.*(?w:...).*
says "match anything up to and after the (?w:...)
part" (i.e., we're only interested in the (?w:...)
part).(?w:...)
says "Turn on the ICU enhanced word breaking / finding feature inside the parenthesis".\\b...\\b
(which is really only a single backslash, any backslash has to be backslash escaped when it's inside a @""
string) says "Match at a word boundary".\\Q...\\E
says "Treat the text starting immediately after \Q
and up to \E
as literal text (think "Quote" and "End")". In other words, any characters in the "quoted literal text" do not have their special regex meaning.The reason for the \Q...\E
is that you probably want to match the literal characters in searchForString
. Without this, searchForString
would be treated as part of the regex. As an example, if searchForString
was this?
, then without \Q...\E
it would not match the literal string this?
, but either thi
or this
, which is probably not what you want. :)