Create a Corpus Containing the Vocabulary of Words

强颜欢笑 提交于 2019-12-11 17:28:41

问题


I am calculating inverse_document_frequency for all the words in my documents dictionary and I have to show the top 5 documents ranked according to the score on queries. But I am stuck in loops while creating corpus containing the vocabulary of words in the documents. Please help me to improve my code. This Block of code used to read my files and removing punctuation and stop words from a file

def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return List of Words
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in english_stopwords]

This block of code is used to store all files name in my folder

file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)

This block of code used to create my dictionary of documents on which i am working

documents = {}
for i in file_names:
documents[i]=wordList(i)

Above codes working good and fast but this block of code taking lot of time creating corpus how can i improve this

#create a corpus containing the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
     for doc in documents.values(): #iterate through documents 
        for word in doc: #go through each word in the current doc
            if not word in corpus: 
                corpus.append(word) #add word in corpus if not already added

This code creates a dictionary that will store document frequency for each word in the corpus

df_corpus = {} #document frequency for every word in corpus
for word in corpus:
    k = 0 #initial document frequency set to 0
    for doc in documents.values(): #iterate through documents
        if word in doc.split(): #check if word in doc
            k+=1 
    df_corpus[word] = k

From 2 hours it creating corpus and still creating Please help me to improve my code. This is the data set I am working with https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ


回答1:


How about instead of list, setting corpus as a set type? you won't need additional if too.

corpus = set() # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    corpus.update(doc) #add word in corpus if not already added


来源:https://stackoverflow.com/questions/57983960/create-a-corpus-containing-the-vocabulary-of-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!