I have a series of HTML files that are parsed into a single text file using Beautiful Soup. The HTML files are formatted such that their output is always three lines within
I'm not entirely sure what CSV library you're using, but it doesn't look like Python's built-in one. Anyway, here's how I'd do it:
import csv
import itertools
with open('extracted.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line for line in stripped if line)
grouped = itertools.izip(*[lines] * 3)
with open('extracted.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('title', 'intro', 'tagline'))
writer.writerows(grouped)
This sort of makes a pipeline. It first gets data from the file, then removes all the whitespace from the lines, then removes any empty lines, then groups them into groups of three, and then (after writing the CSV header) writes those groups to the CSV file.
To combine the last two columns as you mentioned in the comments, you could change the writerow
call in the obvious way and the writerows
to:
writer.writerows((title, intro + tagline) for title, intro, tagline in grouped)
Perhaps I didn't understand you correctly, but you can do:
file = open("extracted.txt")
# if you don't want to do .strip() again, just create a list of the stripped
# lines first.
lines = [line.strip() for line in file if line.strip()]
for i, line in enumerate(lines):
csv.SetCell(i % 3, line)