nlp | 易学教程

Generate misspelled words (typos)

阅读更多关于 Generate misspelled words (typos)

问题 I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data. Let's say I have a document containing the text: {"text": "The quick brown fox jumps over the lazy dog"} I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog". In other words, I want to add noise to strings to generate misspelled words (typos). What would be a way of automatically generating words with

Generate misspelled words (typos)

阅读更多关于 Generate misspelled words (typos)

Tokenizing an HTML document

阅读更多关于 Tokenizing an HTML document

问题 I have an HTML document and I'd like to tokenize it using spaCy while keeping HTML tags as a single token. Here's my code: import spacy from spacy.symbols import ORTH nlp = spacy.load('en', vectors=False, parser=False, entity=False) nlp.tokenizer.add_special_case(u'', [{ORTH: u''}]) nlp.tokenizer.add_special_case(u'', [{ORTH: u''}]) doc = nlp('Hello, world !') print([e.text for e in doc]) The output is: ['Hello', ',', '<', 'i', '>', 'world</i', '>', '!'] If I put spaces

Python NLTK WUP Similarity Score not unity for exact same word

阅读更多关于 Python NLTK WUP Similarity Score not unity for exact same word

问题 Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here? from nltk.corpus import wordnet as wn actual=wn.synsets('orange')[0] predicted=wn.synsets('orange')[0] similarity=actual.wup_similarity(predicted) print similarity similarity=actual.wup_similarity(actual) print similarity 回答1: This is an interesting

Python NLTK WUP Similarity Score not unity for exact same word

阅读更多关于 Python NLTK WUP Similarity Score not unity for exact same word

How to use DBpedia properties to build a topic hierarchy?

阅读更多关于 How to use DBpedia properties to build a topic hierarchy?

问题 I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties. skos:broader property dcterms:subject property My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc. However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

阅读更多关于 Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

问题 The following is how I understand the point of parameter sharing in RNNs: In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames). However, sequential input data like sentences can come in highly

Finding conditional probability of trigram in python nltk

阅读更多关于 Finding conditional probability of trigram in python nltk

问题 I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(brown.words())) However I want to find conditional probability using trigrams. When I try to change nltk.bigrams to nltk.trigrams I get the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "home/env/local/lib/python2.7

Finding conditional probability of trigram in python nltk

阅读更多关于 Finding conditional probability of trigram in python nltk

FastText - Cannot load model.bin due to C++ extension failed to allocate the memory

阅读更多关于 FastText - Cannot load model.bin due to C++ extension failed to allocate the memory

问题 I'm trying to use the FastText Python API https://pypi.python.org/pypi/fasttext Although, from what I've read, this API can't load the newer .bin model files at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md as suggested in https://github.com/salestock/fastText.py/issues/115 I've tried everything that is suggested at that issue, and furthermore https://github.com/Kyubyong/wordvectors doesn't have the .bin for English, otherwise the problem would be solved. Does