Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?
This program might help. It is essentially a wrapper around the csv.reader()
object, which wrapper greps the good data out.
import pandas as pd
import csv
import sys
def ignore_comments(fp, start_fn, end_fn, keep_initial):
state = 'keep' if keep_initial else 'start'
for line in fp:
if state == 'start' and start_fn(line):
state = 'keep'
yield line
elif state == 'keep':
if end_fn(line):
state = 'drop'
else:
yield line
elif state == 'drop':
if start_fn(line):
state = 'keep'
if __name__ == "__main__":
df = open('x.in')
df = csv.reader(df, skipinitialspace=True)
df = ignore_comments(
df,
lambda x: x == ['header1', 'header2', 'header3'],
lambda x: x == [],
False)
df = pd.read_csv(df, engine='python')
print df
Yes, there is a more pythonic way to do that based on Pandas, (this is a quick demonstration to answer the question)
import pandas as pd
from StringIO import StringIO
#define an example to showcase the solution
st = """blah blah here's a test and
here's some information
you don't care about
even a little bit
header1, header2, header3
1, 2, 3
4, 5, 6
oh you have another test
here's some more garbage
that's different than the last one
this should make
life interesting
header1, header2, header3
7, 8, 9
10, 11, 12
13, 14, 15"""
# 1- read the data with pd.read_csv
# 2- specify that you want to drop bad lines, error_bad_lines=False
# 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.
data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None)
# the trash will be loaded as follows
# blah blah here's a test and NaN NaN
# let's drop these rows
data = data.dropna()
# remove the rows which contain "header1","header2", "header3"
mask = data["header1"].str.contains('header*')
data = data[~mask]
print data
Now your dataFrame looks like this:
header1 header2 header3
5 1 2 3
6 4 5 6
13 7 8 9
14 10 11 12
15 13 14 15