Term split by hashtag of multiple words

后端 未结 2 1640
伪装坚强ぢ
伪装坚强ぢ 2021-02-11 05:49

I am trying to split a term which contains a hashtag of multiple words such as \"#I-am-great\" or \"#awesome-dayofmylife\'
then the output that I am looking for is:

相关标签:
2条回答
  • 2021-02-11 06:44

    All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.

    However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:

    wordList = '''awesome day of my life because i am great something some
    thing things unclear sun clear'''.split()
    
    wordOr = '|'.join(wordList)
    
    def splitHashTag(hashTag):
      for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
        print ':', wordSequence   
        for word in re.findall(wordOr, wordSequence):
          print word,
        print
    
    for hashTag in '''awesome-dayofmylife iamgreat something
    somethingsunclear'''.split():
      print '###', hashTag
      splitHashTag(hashTag)
    

    This prints:

    ### awesome-dayofmylife
    : awesome
    awesome
    : dayofmylife
    day of my life
    ### iamgreat
    : iamgreat
    i am great
    ### something
    : something
    something
    ### somethingsunclear
    : somethingsunclear
    something sun clear
    

    And as you see it falls into the trap qstebom has set for it ;-)

    EDIT:

    Some explanations of the code above:

    The variable wordOr contains a string of all words, separated by a pipe symbol (|). In regular expressions that means "one of these words".

    The first findall gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall finds all these sequences, so I iterate over them (for wordSequence in …). For each word sequence then I search each single word (also using findall) in the sequence and print that word.

    0 讨论(0)
  • 2021-02-11 06:46

    The problem can be broken down to several steps:

    1. Populate a list with English words
    2. Split the sentence into terms delimited by white-space.
    3. Treat terms starting with '#' as hashtags
    4. For each hashtag, find words by longest match by checking if they exist in the list of words.

    Here is one solution using this approach:

    # Returns a list of common english terms (words)
    def initialize_words():
        content = None
        with open('C:\wordlist.txt') as f: # A file containing common english words
            content = f.readlines()
        return [word.rstrip('\n') for word in content]
    
    
    def parse_sentence(sentence, wordlist):
        new_sentence = "" # output    
        terms = sentence.split(' ')    
        for term in terms:
            if term[0] == '#': # this is a hashtag, parse it
                new_sentence += parse_tag(term, wordlist)
            else: # Just append the word
                new_sentence += term
            new_sentence += " "
    
        return new_sentence 
    
    
    def parse_tag(term, wordlist):
        words = []
        # Remove hashtag, split by dash
        tags = term[1:].split('-')
        for tag in tags:
            word = find_word(tag, wordlist)    
            while word != None and len(tag) > 0:
                words.append(word)            
                if len(tag) == len(word): # Special case for when eating rest of word
                    break
                tag = tag[len(word):]
                word = find_word(tag, wordlist)
        return " ".join(words)
    
    
    def find_word(token, wordlist):
        i = len(token) + 1
        while i > 1:
            i -= 1
            if token[:i] in wordlist:
                return token[:i]
        return None 
    
    
    wordlist = initialize_words()
    sentence = "big #awesome-dayofmylife because #iamgreat"
    parse_sentence(sentence, wordlist)
    

    It prints:

    'big awe some day of my life because i am great '
    

    You will have to remove the trailing space, but that's easy. :)

    I got the wordlist from http://www-personal.umich.edu/~jlawler/wordlist.

    0 讨论(0)
提交回复
热议问题