nlp

Generate misspelled words (typos)

别来无恙 提交于 2021-02-07 15:53:35
问题 I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data. Let's say I have a document containing the text: {"text": "The quick brown fox jumps over the lazy dog"} I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog". In other words, I want to add noise to strings to generate misspelled words (typos). What would be a way of automatically generating words with

Generate misspelled words (typos)

三世轮回 提交于 2021-02-07 15:50:20
问题 I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data. Let's say I have a document containing the text: {"text": "The quick brown fox jumps over the lazy dog"} I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog". In other words, I want to add noise to strings to generate misspelled words (typos). What would be a way of automatically generating words with

Tokenizing an HTML document

元气小坏坏 提交于 2021-02-07 14:23:38
问题 I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token. Here's my code: import spacy from spacy.symbols import ORTH nlp = spacy.load('en', vectors=False, parser=False, entity=False) nlp.tokenizer.add_special_case(u'<i>', [{ORTH: u'<i>'}]) nlp.tokenizer.add_special_case(u'</i>', [{ORTH: u'</i>'}]) doc = nlp('Hello, <i>world</i> !') print([e.text for e in doc]) The output is: ['Hello', ',', '<', 'i', '>', 'world</i', '>', '!'] If I put spaces

Python NLTK WUP Similarity Score not unity for exact same word

人走茶凉 提交于 2021-02-07 12:52:05
问题 Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here? from nltk.corpus import wordnet as wn actual=wn.synsets('orange')[0] predicted=wn.synsets('orange')[0] similarity=actual.wup_similarity(predicted) print similarity similarity=actual.wup_similarity(actual) print similarity 回答1: This is an interesting

Python NLTK WUP Similarity Score not unity for exact same word

孤人 提交于 2021-02-07 12:51:19
问题 Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here? from nltk.corpus import wordnet as wn actual=wn.synsets('orange')[0] predicted=wn.synsets('orange')[0] similarity=actual.wup_similarity(predicted) print similarity similarity=actual.wup_similarity(actual) print similarity 回答1: This is an interesting

How to use DBpedia properties to build a topic hierarchy?

佐手、 提交于 2021-02-07 08:19:59
问题 I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties. skos:broader property dcterms:subject property My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc. However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

久未见 提交于 2021-02-07 06:54:32
问题 The following is how I understand the point of parameter sharing in RNNs: In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames). However, sequential input data like sentences can come in highly

Finding conditional probability of trigram in python nltk

你离开我真会死。 提交于 2021-02-07 06:26:05
问题 I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())) However I want to find conditional probability using trigrams. When I try to change nltk.bigrams to nltk.trigrams I get the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "home/env/local/lib/python2.7

Finding conditional probability of trigram in python nltk

回眸只為那壹抹淺笑 提交于 2021-02-07 06:25:32
问题 I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())) However I want to find conditional probability using trigrams. When I try to change nltk.bigrams to nltk.trigrams I get the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "home/env/local/lib/python2.7

FastText - Cannot load model.bin due to C++ extension failed to allocate the memory

随声附和 提交于 2021-02-07 05:58:07
问题 I'm trying to use the FastText Python API https://pypi.python.org/pypi/fasttext Although, from what I've read, this API can't load the newer .bin model files at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md as suggested in https://github.com/salestock/fastText.py/issues/115 I've tried everything that is suggested at that issue, and furthermore https://github.com/Kyubyong/wordvectors doesn't have the .bin for English, otherwise the problem would be solved. Does