Finding multiple substrings in a string without iterating over it multiple times

后端 未结 3 575
忘了有多久
忘了有多久 2021-01-04 21:18

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = \'akhgvfalfhda.dhgfa.lidh         


        
相关标签:
3条回答
  • 2021-01-04 21:49

    One way I could think of to improve is:

    • Get all unique lengths of the words in _legal
    • Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
    • Now look for each thing in the dictionary in O(1).

    Code:

    line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
    _legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
    ul = {len(i) for i in _legal}
    s=set()
    for l in ul:
        s = s.union({line[i:i+l] for i in range(len(line)-l)})
    print(s.intersection(set(_legal)))
    

    Output:

    {'thing1', 'fj la', 'thing2', 't5', '5'}
    
    0 讨论(0)
  • 2021-01-04 21:57

    The following should work quite efficiently. It implements the suggestion from @SomeDude given above

    lens=set([len(i) for i in _legal])
    d={}
    for k in lens:
        d[k]=[line[i:i+k] for i in range(len(line)-k)]
    s=set(sum(d.values(), []))
    result=list(s.intersection(set(_legal)))
    

    for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)

    line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
    _legal = ['thing1', 'thing2', 'thing3', 'thing4',...]
    

    Output is:

    print(result)
    
    ['thing2', 'thing1']
    

    Explanation: We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.

    0 讨论(0)
  • 2021-01-04 22:08

    One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

    import re
    
    line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
    _legal = ['thing1', 'thing2', 'thing3', 'thing4']
    
    exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
    exp.findall(line)
    
    >>> ['Thing1', 'Thing2']
    
    0 讨论(0)
提交回复
热议问题