How to use os.walk to only list text files

南楼画角 提交于 2019-12-23 01:11:46

问题


This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done.

But this still leaves mislabeled files or files without an extension. There are hex values that uniquely identify filetypes known as magic numbers or file signatures. here and here Unfortunately, magic numbers do not exist for text files (see here).

One strategy that I have come up with is to parse the first bunch of characters to make sure they are words by doing a dictionary lookup(I am only dealing with English texts) Then only proceed with the full text processing if that is true.This approach seems rather heavy and expensive (doing a bunch of dictionary lookups for each file). Another approach is simply to look for the word 'the' which is unlikely to be frequent in a data file but commonly found in text files. But false negatives would cause me to lose text files for processing. I tried asking google for the longest text without the word 'the' but had no luck with that.

I do not know if this is the appropriate forum for this kind of question-it's almost a question of AI rather than computer science/coding. It's not as difficult as gibberish detection. The texts may not be semantically or syntactically correct- they might just be words like the inventory of a stockroom but also they might be prose and poetry. I just do not want to process files that could be byte code,source code, or collections of alphanumeric characters that are not English words.


回答1:


You can use Python's mimetypes library to check whether a file is a plaintext file.

import os
import mimetypes

for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        if mimetypes.guess_type(filename)[0] == 'text/plain':
            print(os.path.join(dirpath, filename))

UPDATE: Since the mimetypes library uses file extension to determine the type of file, it is not very reliable, especially since you mentioned that some files are mislabeled or without extensions.

For those cases you can use the magic library (which is not in the standard library unfortunately).

import os
import magic

mime = magic.Magic(mime=True)
for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        fullpath = os.path.join(dirpath, filename)
        if mime.from_file(fullpath) == 'text/plain':
            print(fullpath)

UPDATE 2: The above solution wouldn't catch files you would otherwise consider "plaintext" (e.g. XML files, source files, etc). The following solution should work in those cases:

import os
import magic

for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        fullpath = os.path.join(dirpath, filename)
        if 'text' in magic.from_file(fullpath):
            print(fullpath)

Let me know if any of these works for you.




回答2:


A pretty good heuristic is to look for null bytes at the beginning of the file. Text files don't typically have them and binary files usually have lots of them. Below checks that the first 1K bytes contain no nulls. You can of course adjust how much or little of the file to read:

#!python3
import os

def textfiles(root):
    for path,dirs,files in os.walk(root):
        for file in files:
            fullname = os.path.join(path,file)
            with open(fullname,'rb') as f:
                data = f.read(1024)
            if not 0 in data:
                yield fullname

for file in textfiles('.'):
    print(file)


来源:https://stackoverflow.com/questions/35497473/how-to-use-os-walk-to-only-list-text-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!