I am using the following Python - Beautifulsoup code to remove html elements from a text file:
from bs4 import BeautifulSoup
with open(\"textFileWithHtml.txt\")
The glob module lets you list all the files in a directory:
import glob
for path in glob.glob('*.txt'):
with open(path) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + path, "w") as f:
f.write(soup.get_text().encode('utf-8'))
If you want to also do that for every subfolder recursively, check out os.walk
I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv
array, and invoke the script inside a loop or with a regular expression that matches many files, like:
from bs4 import BeautifulSoup
import sys
for fi in sys.argv[1:]:
with open(fi) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + fi, "w") as f:
f.write(soup.get_text().encode('utf-8'))
And run it like:
python script.py *.txt