nlp | 易学教程

Python parse text from multiple txt file

阅读更多关于 Python parse text from multiple txt file

问题 Seeking advice on how to mine items from multiple text files to build a dictionary. This text file: https://pastebin.com/Npcp3HCM Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view There are thousands of such text files and they may have different section headings as shown in these examples: https://pastebin.com/wWSPGaLX https://pastebin.com/9Up4RWHu I started off by reading the files from glob import glob txtPth = '..

Python parse text from multiple txt file

阅读更多关于 Python parse text from multiple txt file

Add DropOut after loading the weights in Keras

阅读更多关于 Add DropOut after loading the weights in Keras

问题 I am doing king of transfer learning. What I have done is First train the model with the big datasets and save the weights. Then I train the model with my dataset by freezing the layers. But I see there was some overfitting. So I try to change the dropout of the model and load the weights since the numbers are changing while drop out are changing. I find difficulties to change the dropout. Directly my question is, Is it possible to change the model's dropout while loading the weights? my

Difference between Fasttext .vec and .bin file

阅读更多关于 Difference between Fasttext .vec and .bin file

问题 I recently downloaded fasttext pretrained model for english. I got two files: wiki.en.vec wiki.en.bin I am not sure what is the difference between the two files? 回答1: The .vec files contain only the aggregated word vectors, in plain-text. The .bin files in addition contain the model parameters, and crucially, the vectors for all the n-grams. So if you want to encode words you did not train with using those n-grams (FastText's famous "subword information"), you need to find an API that can

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

阅读更多关于 How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

阅读更多关于 How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

阅读更多关于 Hierarchical Dirichlet Process Gensim topic number independent of corpus size

问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

阅读更多关于 Hierarchical Dirichlet Process Gensim topic number independent of corpus size

Twitter Sentiments Analysis useful features

阅读更多关于 Twitter Sentiments Analysis useful features

问题 I'm trying to implement Sentiments Analysis functionality and looking for useful features which can be extracted from tweet messages.The features which I have in my mind for now are: Sentiment words Emotion icons Exclamation marks Negation words Intensity words(very,really etc) Is there any other useful features for this task? My goal is not only detect that tweet is positive or negative but also I need to detect level of positivity or negativity(let say in a scale from 0 to 100). Any inputs

How to restore punctuation using Python? [closed]

阅读更多关于 How to restore punctuation using Python? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . Improve this question I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence: I am XYZ I want to execute I have a doubt And I would like to detect that there should be 1 commas and 1 full stop in the above example: I am XYZ,