removing stop words without using nltk corpus

倖福魔咒の 提交于 2019-12-12 04:56:18

问题


I am trying to remove stop words in a text file without using nltk. I have f1,f2,f3 three text files. f1 has text line by line and f2 has stop words list and f3 is empty file. I want to read f1 line by line and in turn word by word and need to check whether it is in f2(stop words). If the word is not in the stop word then write the word in f3. Thus at the end f3 should have text as in f1 but in each line, words in f2(stop words) should be removed.

f1 = open("file1.txt","r")
f2 = open("stop.txt","r")
f3 = open("file2.txt","w")

for line in f1:
    words = line.split()
    for word in words:
        t=word

for line in f2:
    w = line.split()
    for word in w:
        t1=w
        if t!=t1:
            f3.write(word)

f1.close()
f2.close()
f3.close()

this code is wrong. but can any one do this task by changing the code.

Thanks in Advance.


回答1:


YOu can use Linux Sed method for removing the stopwords

sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt



回答2:


What I would personally do is loop through the list of stop words (f2) and append each word to a list in your script. Ex:

stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
    w = line.split()
    for word in w:
        stoplist.append(word)
#end 
for line in file1:
    w = line.split()
    for word in w:
        if word in stoplist: continue
        else: 
            file3.write(word)
#end 
file1.close()
file2.close()
file3.close()



回答3:


your first for loop is wrong because by this command for word in words: t=word you havnt all words in t the words is a list and you can work with it : also if your files contain multiple line your list dont contain all words !! you must do it like this ! it works correctly !

f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
 words = line.split()
 for w in words:
  first_words.append(w)

for line in f2:
 w = line.split()
 for i in w:
  second_words.append(i)


for word1 in first_words :
 for word2 in second_words:
   if word1==word2:
    first_words.remove(word2)

for word in first_words:
 f3.write(word)
 f3.write(' ')

f1.close()
f2.close()
f3.close()


来源:https://stackoverflow.com/questions/24593068/removing-stop-words-without-using-nltk-corpus

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!