I\'m using Python 3.5.2
I have two lists
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
With join
/split
tricks you can avoid loops at all which should speed up the algorithm.
merged_sentences = ' * '.join(sentences)
|
"or" regex statement:regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
"".join
complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split
you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
b.t.w. don't use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.