Parsing a tweet to extract hashtags into an array

前端 未结 9 720
旧时难觅i
旧时难觅i 2020-12-03 05:33

I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I

相关标签:
9条回答
  • 2020-12-03 06:00

    Suppose that you have to retrieve your #Hashtags from a sentence full of punctuation symbols. Let's say that #stackoverflow #people and #helpfulare terminated with different symbols, you want to retrieve them from text but you may want to avoid repetitions:

    >>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"
    

    if you try with set([i for i in text.split() if i.startswith("#")]) alone, you will get:

    >>> set(['#helpful???',
     '#people',
     '#stackoverflow,',
     '#stackoverflow',
     '#helpful!!!',
     '#helpful!',
     '#people...'])
    

    which in my mind is redundant. Better solution using RE with module re:

    >>> import re
    >>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
    >>> set(['#people', '#helpful', '#stackoverflow'])
    

    Now it's ok for me.

    EDIT: UNICODE #Hashtags

    Add the re.UNICODE flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the #Hashtags may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-)

    For example:

    >>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"
    

    will be unicode-encoded as:

    >>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'
    

    and you can retrieve your (correctly encoded) #Hashtags in this way:

    >>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
    >>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
    

    EDITx2: UNICODE #Hashtags and control for # repetitions

    If you want to control for multiple repetitions of the # symbol, as in (forgive me if the text example has become almost unreadable):

    >>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
    >>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'
    

    then you should substitute these multiple occurrences with a unique #. A possible solution is to introduce another nested implicit set() definition with the sub() function replacing occurrences of more-than-1 # with a single #:

    >>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
    >>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l']) 
    
    0 讨论(0)
  • 2020-12-03 06:00

    The best Twitter hashtag regular expression:

    import re
    text = "#promovolt #1st # promovolt #123"
    re.findall(r'\B#\w*[a-zA-Z]+\w*', text)
    
    >>> ['#promovolt', '#1st']
    

    0 讨论(0)
  • 2020-12-03 06:03

    simple gist (better than chosen answer) https://gist.github.com/mahmoud/237eb20108b5805aed5f also work with unicode hashtags

    0 讨论(0)
  • 2020-12-03 06:05

    i had a lot of issues with unicode languages.

    i had seen many ways to extract hashtag, but found non of them answering on all cases

    so i wrote some small python code to handle most of the cases. it works for me.

    def get_hashtagslist(string):
        ret = []
        s=''
        hashtag = False
        for char in string:
            if char=='#':
                hashtag = True
                if s:
                    ret.append(s)
                    s=''           
                continue
    
            # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
            if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
                ret.append(s)
                s=''
                hashtag=False 
    
            if hashtag:
                s+=char
    
        if s:
            ret.append(s)
    
        return list(set([word for word in ret if len(ret)>1 and len(ret)<20]))
    
    0 讨论(0)
  • 2020-12-03 06:07

    I extracted hashtags in a silly but effective way.

    def retrive(s):
        indice_t = []
        tags = []
        tmp_str = ''
        s = s.strip()
        for i in range(len(s)):
            if s[i] == "#":
                indice_t.append(i)
        for i in range(len(indice_t)):
            index = indice_t[i]
            if i == len(indice_t)-1:
                boundary = len(s)
            else:
                boundary = indice_t[i+1]
            index += 1
            while index < boundary:
                if s[index] in "`~!@#$%^&*()-_=+[]{}|\\:;'"",.<>?/ \n\t":
                    tags.append(tmp_str)
                    tmp_str = ''
                    break
                else:
                    tmp_str += s[index]
                    index += 1
            if tmp_str != '':
                tags.append(tmp_str)
        return tags
    
    0 讨论(0)
  • 2020-12-03 06:08
    >>> s="I love #stackoverflow because #people are very #helpful!"
    >>> [i  for i in s.split() if i.startswith("#") ]
    ['#stackoverflow', '#people', '#helpful!']
    
    0 讨论(0)
提交回复
热议问题