Term split by hashtag of multiple words

后端未结

关注

 2  1671

伪装坚强ぢ 2021-02-11 05:49

I am trying to split a term which contains a hashtag of multiple words such as \"#I-am-great\" or \"#awesome-dayofmylife\'
then the output that I am looking for is:

2条回答

逝去的感伤 (楼主)

2021-02-11 06:44
All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.

However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:
```
wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()

wordOr = '|'.join(wordList)

def splitHashTag(hashTag):
  for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
    print ':', wordSequence   
    for word in re.findall(wordOr, wordSequence):
      print word,
    print

for hashTag in '''awesome-dayofmylife iamgreat something
somethingsunclear'''.split():
  print '###', hashTag
  splitHashTag(hashTag)
```
This prints:
```
### awesome-dayofmylife
: awesome
awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
something
### somethingsunclear
: somethingsunclear
something sun clear
```
And as you see it falls into the trap qstebom has set for it ;-)

EDIT:

Some explanations of the code above:

The variable wordOr contains a string of all words, separated by a pipe symbol (|). In regular expressions that means "one of these words".

The first findall gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall finds all these sequences, so I iterate over them (for wordSequence in …). For each word sequence then I search each single word (also using findall) in the sequence and print that word.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...