问题
I am calculating inverse_document_frequency for all the words in my documents dictionary and I have to show the top 5 documents ranked according to the score on queries. But I am stuck in loops while creating corpus containing the vocabulary of words in the documents. Please help me to improve my code. This Block of code used to read my files and removing punctuation and stop words from a file
def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return List of Words
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in english_stopwords]
This block of code is used to store all files name in my folder
file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)
This block of code used to create my dictionary of documents on which i am working
documents = {}
for i in file_names:
documents[i]=wordList(i)
Above codes working good and fast but this block of code taking lot of time creating corpus how can i improve this
#create a corpus containing the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents
for word in doc: #go through each word in the current doc
if not word in corpus:
corpus.append(word) #add word in corpus if not already added
This code creates a dictionary that will store document frequency for each word in the corpus
df_corpus = {} #document frequency for every word in corpus
for word in corpus:
k = 0 #initial document frequency set to 0
for doc in documents.values(): #iterate through documents
if word in doc.split(): #check if word in doc
k+=1
df_corpus[word] = k
From 2 hours it creating corpus and still creating Please help me to improve my code. This is the data set I am working with https://drive.google.com/open?id=1D1GjN_JTGNBv9rPNcWJMeLB_viy9pCfJ
回答1:
How about instead of list, setting corpus as a set type? you won't need additional if too.
corpus = set() # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents
corpus.update(doc) #add word in corpus if not already added
来源:https://stackoverflow.com/questions/57983960/create-a-corpus-containing-the-vocabulary-of-words