I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I
Suppose that you have to retrieve your #Hashtags
from a sentence full of punctuation symbols. Let's say that #stackoverflow #people
and #helpful
are terminated with different symbols, you want to retrieve them from text
but you may want to avoid repetitions:
>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"
if you try with set([i for i in text.split() if i.startswith("#")])
alone, you will get:
>>> set(['#helpful???',
'#people',
'#stackoverflow,',
'#stackoverflow',
'#helpful!!!',
'#helpful!',
'#people...'])
which in my mind is redundant. Better solution using RE with module re
:
>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])
Now it's ok for me.
EDIT: UNICODE #Hashtags
Add the re.UNICODE
flag if you want to delete punctuations, but still preserving letters with accents, apostrophes and other unicode-encoded stuff which may be important if the #Hashtags
may be expected not to be only in english... maybe this is only an italian guy nightmare, maybe not! ;-)
For example:
>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"
will be unicode-encoded as:
>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'
and you can retrieve your (correctly encoded) #Hashtags
in this way:
>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
EDITx2: UNICODE #Hashtags
and control for #
repetitions
If you want to control for multiple repetitions of the #
symbol, as in (forgive me if the text
example has become almost unreadable):
>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'
then you should substitute these multiple occurrences with a unique #
.
A possible solution is to introduce another nested implicit set()
definition with the sub()
function replacing occurrences of more-than-1 #
with a single #
:
>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
The best Twitter hashtag regular expression:
import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)
>>> ['#promovolt', '#1st']
simple gist (better than chosen answer) https://gist.github.com/mahmoud/237eb20108b5805aed5f also work with unicode hashtags
i had a lot of issues with unicode languages.
i had seen many ways to extract hashtag, but found non of them answering on all cases
so i wrote some small python code to handle most of the cases. it works for me.
def get_hashtagslist(string):
ret = []
s=''
hashtag = False
for char in string:
if char=='#':
hashtag = True
if s:
ret.append(s)
s=''
continue
# take only the prefix of the hastag in case contain one of this chars (like on: '#happy,but i..' it will takes only 'happy' )
if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
ret.append(s)
s=''
hashtag=False
if hashtag:
s+=char
if s:
ret.append(s)
return list(set([word for word in ret if len(ret)>1 and len(ret)<20]))
I extracted hashtags in a silly but effective way.
def retrive(s):
indice_t = []
tags = []
tmp_str = ''
s = s.strip()
for i in range(len(s)):
if s[i] == "#":
indice_t.append(i)
for i in range(len(indice_t)):
index = indice_t[i]
if i == len(indice_t)-1:
boundary = len(s)
else:
boundary = indice_t[i+1]
index += 1
while index < boundary:
if s[index] in "`~!@#$%^&*()-_=+[]{}|\\:;'"",.<>?/ \n\t":
tags.append(tmp_str)
tmp_str = ''
break
else:
tmp_str += s[index]
index += 1
if tmp_str != '':
tags.append(tmp_str)
return tags
>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']