Python - beautifulsoup, apply in every text file in folder and produce new text file

前端 未结 2 1938
暖寄归人
暖寄归人 2021-01-21 18:15

I am using the following Python - Beautifulsoup code to remove html elements from a text file:

from bs4 import BeautifulSoup

with open(\"textFileWithHtml.txt\")         


        
相关标签:
2条回答
  • 2021-01-21 18:44

    The glob module lets you list all the files in a directory:

    import glob
    for path in glob.glob('*.txt'):
        with open(path) as markup:
            soup = BeautifulSoup(markup.read())
    
        with open("strip_" + path, "w") as f: 
            f.write(soup.get_text().encode('utf-8'))
    

    If you want to also do that for every subfolder recursively, check out os.walk

    0 讨论(0)
  • 2021-01-21 18:45

    I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv array, and invoke the script inside a loop or with a regular expression that matches many files, like:

    from bs4 import BeautifulSoup
    import sys
    
    for fi in sys.argv[1:]:
        with open(fi) as markup:
            soup = BeautifulSoup(markup.read())
    
        with open("strip_" + fi, "w") as f: 
            f.write(soup.get_text().encode('utf-8'))
    

    And run it like:

    python script.py *.txt
    
    0 讨论(0)
提交回复
热议问题