I have a csv file that looks like this:
TEST
2012-05-01 00:00:00.203 ON 1
2012-05-01 00:00:11.203 OFF 0
2012-05-01 00:00:22.203 ON 1
2012-05-01 00:00
When you get the row
from the csv.reader
, and when you can be sure that the first element is a string, then you can use
if not row[0].startswith('TEST'):
process(row)
Another option, since I just ran into this problem also:
import pandas as pd
import subprocess
grep = subprocess.check_output(['grep', '-n', '^TITLE', filename]).splitlines()
bad_lines = [int(s[:s.index(':')]) - 1 for s in grep]
df = pd.read_csv(filename, skiprows=bad_lines)
It's less portable than @eumiro's (read: probably doesn't work on Windows) and requires reading the file twice, but has the advantage that you don't have to store the entire file contents in memory.
You could of course do the same thing as the grep in Python, but it'd probably be slower.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html?highlight=read_csv#pandas.io.parsers.read_csv
skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int)
Pass [0, 6]
to skip rows with "TEST".
from cStringIO import StringIO
import pandas
s = StringIO()
with open('file.csv') as f:
for line in f:
if not line.startswith('TEST'):
s.write(line)
s.seek(0) # "rewind" to the beginning of the StringIO object
pandas.read_csv(s) # with further parameters…