nltk

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

我的未来我决定 提交于 2021-02-09 08:17:29
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

有些话、适合烂在心里 提交于 2021-02-09 08:16:42
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

NLTK tokenizer and Stanford corenlp tokenizer cannot distinct 2 sentences without space at period (.)

不羁岁月 提交于 2021-02-09 08:16:00
问题 I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (both word and sent), nltk cannot distinct the between cat.I. Here is word tokenize >>> nltk.word_tokenize(w1, 'english') ['I', 'am', 'Pusheen', 'the', 'cat.I', 'am', 'so', 'cute'] >>> nltk.word_tokenize(w2, 'english') ['I', 'am', 'Pusheen', 'the', 'cat', '.', 'I', 'am', 'so', 'cute'] and sent tokenize >>

Python: NLTK ValueError: A Lidstone probability distribution must have at least one bin?

我们两清 提交于 2021-02-08 11:28:55
问题 For a task I am to use ConditionalProbDist using LidstoneProbDist as the estimator, adding +0.01 to the sample count for each bin. I thought the following line of code would achieve this, but it produces a value error fd = nltk.ConditionalProbDist(fd,nltk.probability.LidstoneProbDist,0.01) I'm not sure how to format the arguments within ConditionalProbDist and haven't had much luck in finding out how to do so via python's help feature or google, so if anyone could set me right, it would be

Python: NLTK ValueError: A Lidstone probability distribution must have at least one bin?

笑着哭i 提交于 2021-02-08 11:27:33
问题 For a task I am to use ConditionalProbDist using LidstoneProbDist as the estimator, adding +0.01 to the sample count for each bin. I thought the following line of code would achieve this, but it produces a value error fd = nltk.ConditionalProbDist(fd,nltk.probability.LidstoneProbDist,0.01) I'm not sure how to format the arguments within ConditionalProbDist and haven't had much luck in finding out how to do so via python's help feature or google, so if anyone could set me right, it would be

Find similar texts based on paraphrase detection [closed]

眉间皱痕 提交于 2021-02-08 10:32:21
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Improve this question I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably. 回答1: I believe the tool you are looking for is Latent Semantic Analysis. Given that my post is going to

Find similar texts based on paraphrase detection [closed]

时光怂恿深爱的人放手 提交于 2021-02-08 10:31:04
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Improve this question I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably. 回答1: I believe the tool you are looking for is Latent Semantic Analysis. Given that my post is going to

Python NetworkX error: module 'networkx.drawing' has no attribute 'graphviz_layout'

白昼怎懂夜的黑 提交于 2021-02-08 07:33:32
问题 I am teaching myself Python and NLTK for work using the book "Natural Language Processing with Python" ("www.nltk.org/book"). I am stuck on Chapter 4 Section 4 part 8 on NetworkX. When I try to run example 4.15, it should draw a graph, but instead I get the following error message: AttributeError: module 'networkx.drawing' has no attribute 'graphviz_layout' The culprit code line appears to be >>> nx.draw_graphviz(graph, node_size = [16 * graph.degree(n) for n in graph], node_color = [graph

Modify NLTK word_tokenize to prevent tokenization of parenthesis

巧了我就是萌 提交于 2021-02-08 07:32:48
问题 I have the following main.py . #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import nltk import string import sys for token in nltk.word_tokenize(''.join(sys.stdin.readlines())): #print token if len(token) == 1 and not token in string.punctuation or len(token) > 1: print token The output is the following. ./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts' EGR1 -/- mouse embryonic fibroblasts I want to slightly change the tokenizer so

Django webapp (on an Apache2 server) hangs indefintely when importing nltk in views.py

﹥>﹥吖頭↗ 提交于 2021-02-08 06:47:49
问题 To elaborate a little more from the title, I'm having issues importing nltk to use in a django web app. I've deployed the web app on an apache2 server. When I import nltk in views.py, the web page refuses to load and eventually times out after a few minutes of loading. I've installed nltk using pip. I've used pip to install a number of other python packages which I've been able to reference without issue within django. I haven't been able to find anything solid to explain why this would be