问题
In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
output:
[This, is, a, #, sentence, .]
I'd like to have hashtags tokenized as such:
[This, is, a, #sentence, .]
Is that possible?
Thanks
回答1:
- You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g
> >>> import re > >>> import spacy > >>> nlp = spacy.load('en') > >>> sentence = u'This is my twitter update #MyTopic' > >>> parsed = nlp(sentence) > >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence) > >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic' > >>> parsed = nlp(new_sentence) > >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
- You can try setting custom seperators in spacy's tokenizer. I am not aware of methods to do that.
UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge
Merge example:
>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
... parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>
回答2:
This is more of a add-on to the great answer by @DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by @csvance). spaCy features (since V2.0) the add_pipe
method. Meaning you can define @DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.
Citations starts here:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.
PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)
回答3:
I found this on github, which uses spaCy's Matcher
:
from spacy.matcher import Matcher matcher = Matcher(nlp.vocab) matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}]) doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.') matches = matcher(doc) hashtags = [] for match_id, start, end in matches: hashtags.append(doc[start:end]) for span in hashtags: span.merge() print([t.text for t in doc])
outputs:
['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']
A list of hashtags is also available in the hashtags
list:
print(hashtags)
output:
[#sentence, #hashtag, #The, #End]
回答4:
I spent quite a bit of time on this and found I share what I came up with: Subclassing the Tokenizer and adding the regex for hashtags to the default URL_PATTERN was the easiest solution for me, additionally adding a custom extension to match on hashtags to identify them:
import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
def create_tokenizer(nlp):
# contains the regex to match all sorts of urls:
from spacy.lang.tokenizer_exceptions import URL_PATTERN
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)
# extending the default url regex with regex for hashtags with "or" = |
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)
# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=url_and_hashtag_re.match
)
nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great@good.com https://www.somedomain.com/foo")
for token in doc:
print(token.text)
if token._.is_hashtag:
print("-> matches hashtag")
# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great@good.com https://www.somedomain.com/foo"
回答5:
I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.
The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "@Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)
This yields:
['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']
来源:https://stackoverflow.com/questions/43388476/how-could-spacy-tokenize-hashtag-as-a-whole