tokenize

Python NLP - ValueError: could not convert string to float: 'UKN'

天涯浪子 提交于 2020-04-18 12:33:15
问题 I'm trying to train a random forest regressor to predict the hourly wage of an employee given the job description supplied. Note, I've signed an NDA and cannot upload real data. The below "observation" is synthetic: sample_row = {'job_posting_id': 'id_01', 'buyer_vertical': 'Business Services', 'currency': 'USD', 'fg_onet_code': '43-9011.00', 'jp_title': 'Computer Operator', 'jp_description': "Performs information security-related risk and compliance activities, including but not limited to

SpaCy — intra-word hyphens. How to treat them one word?

天涯浪子 提交于 2020-04-11 06:31:23
问题 Following is the code provided as answer to the question; import spacy from spacy.tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex import re nlp = spacy.load('en') infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)") infix_re = spacy.util.compile_infix_regex(infixes) def custom_tokenizer(nlp): return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer) nlp.tokenizer = custom_tokenizer(nlp) s1 = "Marketing

Keras Tokenizer num_words doesn't seem to work

北城以北 提交于 2020-03-17 08:31:28
问题 >>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4} I'd have expected t.word_index to have just the top 3 words. What am I doing wrong? 回答1: There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words

Get bigrams and trigrams in word2vec Gensim

不想你离开。 提交于 2020-02-26 07:23:54
问题 I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the paragraph into sentences raw_sentences = tokenizer.tokenize(review.strip()) sentences = [] for raw_sentence in raw_sentences: # If a sentence is empty, skip it if len(raw_sentence) > 0: # Otherwise, call review_to_wordlist to get a list of words sentences.append(

How to split Text into paragraphs using NLTK nltk.tokenize.texttiling?

随声附和 提交于 2020-02-23 05:32:45
问题 I found this Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html. When I feed my text into texttiling, I get the same untokenized text back, but as a list, which is of no use to me. tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords

Sentence tokenization for texts that contains quotes

半世苍凉 提交于 2020-02-01 03:59:05
问题 Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.', 'Finally they pushed you out of the cold emergency room.', 'I failed to protect you.', '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',] Input: After Du died

Tokenizing issue

十年热恋 提交于 2020-01-25 10:47:05
问题 I am trying to tokenize a sentence as follows. Section <- c("If an infusion reaction occurs, interrupt the infusion.") df <- data.frame(Section) When I tokenize using tidytext and the code below, AA <- df %>% mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"), locations = str_locate_all(df$Section, "([^\\s]+)"), locations = map(locations, as.data.frame)) %>% select(-Section) %>% unnest(tokens, locations) it gives me a result set as below (see image). How do i get the comma and the

How to combine certain set of words into token in Elasticsearch?

孤街醉人 提交于 2020-01-25 09:41:27
问题 For a string like "This is a beautiful day", I want to tokenize the string into tokens: "This, is, a, beautiful, day, beautiful day" where I can specify a certain set of words to combine. In this case only "beautiful" and "day". So far, I have used Shingle filter to produce the token list like below: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day" How can I further filter the token list above to produce my desired result? Here is my current code: shingle_filter = {

How could spacy tokenize hashtag as a whole?

喜夏-厌秋 提交于 2020-01-24 10:06:14
问题 In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] output: [This, is, a, #, sentence, .] I'd like to have hashtags tokenized as such: [This, is, a, #sentence, .] Is that possible? Thanks 回答1: You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g > >>> import re > >>> import

How could spacy tokenize hashtag as a whole?

余生长醉 提交于 2020-01-24 10:05:21
问题 In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] output: [This, is, a, #, sentence, .] I'd like to have hashtags tokenized as such: [This, is, a, #sentence, .] Is that possible? Thanks 回答1: You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g > >>> import re > >>> import