nlp

Python parse text from multiple txt file

无人久伴 提交于 2021-02-07 05:08:41
问题 Seeking advice on how to mine items from multiple text files to build a dictionary. This text file: https://pastebin.com/Npcp3HCM Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view There are thousands of such text files and they may have different section headings as shown in these examples: https://pastebin.com/wWSPGaLX https://pastebin.com/9Up4RWHu I started off by reading the files from glob import glob txtPth = '..

Python parse text from multiple txt file

↘锁芯ラ 提交于 2021-02-07 05:07:02
问题 Seeking advice on how to mine items from multiple text files to build a dictionary. This text file: https://pastebin.com/Npcp3HCM Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view There are thousands of such text files and they may have different section headings as shown in these examples: https://pastebin.com/wWSPGaLX https://pastebin.com/9Up4RWHu I started off by reading the files from glob import glob txtPth = '..

Add DropOut after loading the weights in Keras

余生长醉 提交于 2021-02-07 04:37:50
问题 I am doing king of transfer learning. What I have done is First train the model with the big datasets and save the weights. Then I train the model with my dataset by freezing the layers. But I see there was some overfitting. So I try to change the dropout of the model and load the weights since the numbers are changing while drop out are changing. I find difficulties to change the dropout. Directly my question is, Is it possible to change the model's dropout while loading the weights? my

Difference between Fasttext .vec and .bin file

孤人 提交于 2021-02-06 09:45:11
问题 I recently downloaded fasttext pretrained model for english. I got two files: wiki.en.vec wiki.en.bin I am not sure what is the difference between the two files? 回答1: The .vec files contain only the aggregated word vectors, in plain-text. The .bin files in addition contain the model parameters, and crucially, the vectors for all the n-grams. So if you want to encode words you did not train with using those n-grams (FastText's famous "subword information"), you need to find an API that can

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

荒凉一梦 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

天大地大妈咪最大 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

我的梦境 提交于 2021-02-06 02:35:47
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

痴心易碎 提交于 2021-02-06 02:30:02
问题 I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> len(corpusA) 1113 >>> len(corpusB) 17 Why is the number of topics independent of corpus length? 回答1: @user3907335 is exactly correct here: HDP will calculate as many topics as the

Twitter Sentiments Analysis useful features

◇◆丶佛笑我妖孽 提交于 2021-02-05 20:39:11
问题 I'm trying to implement Sentiments Analysis functionality and looking for useful features which can be extracted from tweet messages.The features which I have in my mind for now are: Sentiment words Emotion icons Exclamation marks Negation words Intensity words(very,really etc) Is there any other useful features for this task? My goal is not only detect that tweet is positive or negative but also I need to detect level of positivity or negativity(let say in a scale from 0 to 100). Any inputs

How to restore punctuation using Python? [closed]

半腔热情 提交于 2021-02-04 21:57:26
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . Improve this question I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence: I am XYZ I want to execute I have a doubt And I would like to detect that there should be 1 commas and 1 full stop in the above example: I am XYZ,