Open every file/subfolder in directory and print results to .txt file

♀尐吖头ヾ 提交于 2020-01-24 19:15:20

问题


At the moment I am working with this code:

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    with stdout2file("output.txt"):
        for file in glob.iglob('**/*.html', recursive=True):
            with open(file, encoding="utf8") as f:
                contents = f.read()
                soup = BeautifulSoup(contents, "html.parser")
                for item in soup.findAll("ix:nonfraction"):
                    if re.match(".*AuditFeesExpenses", item['name']):
                        print(file.split(os.path.sep)[-1], end="| ")
                        print(item['name'], end="| ")
                        print(item.get_text())
trade_spider()

So far this works perfectly. But now I am stucked with another issue. If I search within a folder which has no subfolders but only files this works without problems. However if i try to run this code on a folder that has subfolders it doesn't work (it prints nothing!). Furthermore I would like to get my results print into a .txt file without having the whole path in it. The result should be like:

Filename.html| RegEX Match| HTML text

I do get this result already, but only in PyCharm and not in a seperate .txt file.

To sum up, I do have 2 questions:

  1. How can I also walk through subfolders in my defined Directory? -> would os.walk() be an option for that?
  2. How can I print my results into a .txt file? -> would sys.stdout work on that?

Any help appreciated on this issue!

UPDATE: It only prints the first results of the first file into my "outout.txt" file (at least I think it is the first as it is the last file in my only subfolder and recursive=true is activated). Any idea why it is not looping through all the other files?

UPDATE_2: Question resolved! Final Code can be seen above!


回答1:


For walking in subdirectories, there are two options:

  1. Use ** with glob and the argument recursive=True (glob.glob('**/*.html')). This only works in Python 3.5+. I would also recommend using glob.iglob instead of glob.glob if the directory tree is large.

  2. Use os.walk and check the filenames (whether they end in ".html") manually or with fnmatch.filter.


Regarding the printing into a file, there are again several ways:

  1. Just execute the script and redirect stdout, i.e. python3 myscript.py >myfile.txt

  2. Replace calls to print with a call to the .write() method of a file object in write mode`.

  3. Keep using print, but give it the argument file=myfile where myfile is again a writable file object.

edit: Maybe the most unobstrusive method would be the following. First, include this somewhere:

import contextlib
@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()

And then, infront of the line in which you loop over the files, add this line (and appropriately indent):

with stdout2file("output.txt"):


来源:https://stackoverflow.com/questions/37172980/open-every-file-subfolder-in-directory-and-print-results-to-txt-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!