I am new to python, I wrote an algorithm to read 10 txt files in a folder and then write the first line of each of them in one txt outfile. but it doesn\'t work. I mean afte
Thanks to Eddo Hintoso for his detailed answer, I've slightly tweaked it to use yield
rather than return
so it doesn't need to be mapped. I'm posting it here in case it is useful to anyone else who finds this post.
import glob
files = glob.glob("data/*.txt")
def map_first_lines(file_list):
for file in file_list:
with open(file, 'r') as fd:
yield fd.readline()
[print(f) for f in map_first_lines(files)]
So another way to solve this particular problem:
import glob
def map_first_lines(file_list):
for file in file_list:
with open(file, 'rt') as fd:
yield fd.readline()
def merge_first_lines(file_list, filename='first_lines.txt'):
with open(filename, 'w') as f:
for line in map_first_lines(file_list):
f.write("%s\n" % line)
files = glob.glob("data/*.txt")
merge_first_lines(files)
Lets assume you have files in the folder path
path = /home/username/foldername/
so you have all the files in the path folder, to read all the files in the folder you should use os
or `glob' to do that.
import os
path = "/home/username/foldername/"
savepath = "/home/username/newfolder/"
for dir,subdir,files in os.walk(path):
infile = open(path+files)
outfile = open(savepath,'w')
a = infile.readline().split('.')
for k in range (0,len(a)):
print(a[0], file=outfile, end='')
infile.close()
outfile.close
print "done"
or using glob you can do it much lesser lines of code.
import glob
path = "/home/username/foldername/"
savepath = "/home/username/newfolder/"
for files in glob.glob(path +"*.txt"):
infile = open(files)
outfile = open(savepath,'w')
a = infile.readline().split('.')
for k in range (0,len(a)):
print(a[0], file=outfile, end='')
infile.close()
outfile.close
print "done"
hope it might work for you.
Say you have 12 files in this folder called test
, 10 of which are .txt
files:
.../
test/
01.txt
02.txt
03.txt
04.txt
05.txt
06.txt
07.txt
08.txt
09.txt
10.txt
random_file.py
this_shouldnt_be_here.sh
With each .txt
file having their first line as their corresponding number, like
01.txt
contains the first line 01
,02.txt
contains the first line 02
,You can do this in two ways:
os
moduleYou can import the module os
and use the method listdir
to list all the files in that directory. It is important to note that all files in the list will be relative filenames:
>>> import os
>>> all_files = os.listdir("test/") # imagine you're one directory above test dir
>>> print(all_files) # won't necessarily be sorted
['08.txt', '02.txt', '09.txt', '04.txt', '05.txt', '06.txt', '07.txt', '03.txt', '06.txt', '01.txt', 'this_shouldnt_be_here.sh', '10.txt', 'random_file.py']
Now, you only want the .txt
files, so with a bit of functional programming using the filter
function and anonymous functions, you can easily filter them out without using standard for
loops:
>>> txt_files = filter(lambda x: x[-4:] == '.txt', all_files)
>>> print(txt_files) # only text files
['08.txt', '02.txt', '09.txt', '04.txt', '05.txt', '06.txt', '07.txt', '03.txt', '06.txt', '01.txt', '10.txt']
glob
moduleSimilarly, you can use the glob
module and use the glob.glob
function to list all text files in the directory without using any functional programming above! The only difference is that glob
will output a list with prefix paths, however you inputted it.
>>> import glob
>>> txt_files = glob.glob("test/*.txt")
['test/08.txt', 'test/02.txt', 'test/09.txt', 'test/04.txt', 'test/05.txt', 'test/06.txt', 'test/07.txt', 'test/03.txt', 'test/06.txt', 'test/01.txt', 'test/10.txt']
What I mean by glob
outputting the list by however you input the relative or full path - for example, if you were in the test
directory and you called glob.glob('./*.txt')
, you would get a list like:
>>> glob.glob('./*.txt')
['./08.txt', './02.txt', './09.txt', ... ]
By the way, ./
means in the same directory. Alternatively, you can just not prepend the ./
- but the string representations will accordingly change:
>>> glob.glob("*.txt") # already in directory containing the text files
['08.txt', '02.txt', '09.txt', ... ]
Alright, now the problem with your code is that you are opening these connections to all these files without closing them. Generally, the procedure to do something with a file in python is this:
fd = open(filename, mode)
fd.method # could be write(), read(), readline(), etc...
fd.close()
Now, the problem with this is that if something goes wrong in the second line where you call a method on the file, the file will never close and you're in big trouble.
To prevent this, we use what we call file context manager in Python using the with
keyword. This ensures the file will close with or without failures.
with open(filename, mode) as fd:
fd.method
readline()
As you probably know already, to extract the first line of a file, you simply have to open it and call the readline()
method. We want to do this with all the text files listed in txt_files
, but yes - you can do this with functional programming map
function, except this time we won't be writing an anonymous function (for readability):
>>> def read_first_line(file):
... with open(file, 'rt') as fd:
... first_line = fd.readline()
... return first_line
...
>>> output_strings = map(read_first_line, txt_files) # apply read first line function all text files
>>> print(output_strings)
['08\n', '02\n', '09\n', '04\n', '05\n', '06\n', '07\n', '03\n', '06\n', '01\n', '10\n']
If you want the output_list
to be sorted, just sort the txt_files
beforehand or just sort the output_list
itself. Both works:
output_strings = map(read_first_line, sorted(txt_files))
output_strings = sorted(map(read_first_line, txt_files))
So now you have a list of output strings, and the last thing you want to do, is combine them:
>>> output_content = "".join(sorted(output_strings)) # sort join the output strings without separators
>>> output_content # as a string
'01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n'
>>> print(output_content) # print as formatted
01
02
03
04
05
06
07
08
09
10
Now it's just a matter of writing this giant string to an output file! Let's call it outfile.txt
:
>>> with open('outfile.txt', 'wt') as fd:
... fd.write(output_content)
...
Then you're done! You're all set! Let's confirm it:
>>> with open('outfile.txt', 'rt') as fd:
... print fd.readlines()
...
['01\n', '02\n', '03\n', '04\n', '05\n', '06\n', '07\n', '08\n', '09\n', '10\n']
I'll be using the glob
module so that it will always know what directory I will be accessing my paths from without the hassle of using absolute paths with the os
module and whatnot.
import glob
def read_first_line(file):
"""Gets the first line from a file.
Returns
-------
str
the first line text of the input file
"""
with open(file, 'rt') as fd:
first_line = fd.readline()
return first_line
def merge_per_folder(folder_path, output_filename):
"""Merges first lines of text files in one folder, and
writes combined lines into new output file
Parameters
----------
folder_path : str
String representation of the folder path containing the text files.
output_filename : str
Name of the output file the merged lines will be written to.
"""
# make sure there's a slash to the folder path
folder_path += "" if folder_path[-1] == "/" else "/"
# get all text files
txt_files = glob.glob(folder_path + "*.txt")
# get first lines; map to each text file (sorted)
output_strings = map(read_first_line, sorted(txt_files))
output_content = "".join(output_strings)
# write to file
with open(folder_path + output_filename, 'wt') as outfile:
outfile.write(output_content)
in this example may be you should close the outfile in loop because it is trying to open many times without closing previous one