Python pandas: read csv with multiple tables repeated preamble

后端未结

关注

 2  675

Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?

相关标签:

2条回答

醉话见心

2021-01-13 15:08

This program might help. It is essentially a wrapper around the csv.reader() object, which wrapper greps the good data out.

import pandas as pd
import csv
import sys


def ignore_comments(fp, start_fn, end_fn, keep_initial):
    state = 'keep' if keep_initial else 'start'
    for line in fp:
        if state == 'start' and start_fn(line):
            state = 'keep'
            yield line
        elif state == 'keep':
            if end_fn(line):
                state = 'drop'
            else:
                yield line
        elif state == 'drop':
            if start_fn(line):
                state = 'keep'

if __name__ == "__main__":

    df = open('x.in')
    df = csv.reader(df, skipinitialspace=True)
    df = ignore_comments(
        df,
        lambda x: x == ['header1', 'header2', 'header3'],
        lambda x: x == [],
        False)

    df = pd.read_csv(df, engine='python')
    print df

0 讨论(0)

盖世英雄少女心

2021-01-13 15:20

Yes, there is a more pythonic way to do that based on Pandas, (this is a quick demonstration to answer the question)

import pandas as pd
from StringIO import StringIO

#define an example to showcase the solution
st = """blah blah here's a test and
here's some information  
you don't care about  
even a little bit  
header1, header2, header3  
1, 2, 3  
4, 5, 6  

oh you have another test  
here's some more garbage  
that's different than the last one  
this should make  
life interesting  
header1, header2, header3  
7, 8, 9  
10, 11, 12  
13, 14, 15""" 

# 1- read the data with pd.read_csv  
# 2- specify that you want to drop bad lines, error_bad_lines=False 
# 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.    
data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None) 

# the trash will be loaded as follows 
# blah blah here's a test and       NaN         NaN
# let's drop these rows 
data = data.dropna()

# remove the rows which contain "header1","header2", "header3"
mask = data["header1"].str.contains('header*')
data = data[~mask]
print data

Now your dataFrame looks like this:

   header1 header2 header3
5        1       2     3  
6        4       5     6  
13       7       8     9  
14      10      11    12  
15      13      14      15

0 讨论(0)