Python pandas: read csv with multiple tables repeated preamble

后端 未结 2 675
再見小時候
再見小時候 2021-01-13 14:34

Is there a pythonic way to figure out which rows in a CSV file contain headers and values and which rows contain trash and then get the headers/values rows into data frames?

相关标签:
2条回答
  • 2021-01-13 15:08

    This program might help. It is essentially a wrapper around the csv.reader() object, which wrapper greps the good data out.

    import pandas as pd
    import csv
    import sys
    
    
    def ignore_comments(fp, start_fn, end_fn, keep_initial):
        state = 'keep' if keep_initial else 'start'
        for line in fp:
            if state == 'start' and start_fn(line):
                state = 'keep'
                yield line
            elif state == 'keep':
                if end_fn(line):
                    state = 'drop'
                else:
                    yield line
            elif state == 'drop':
                if start_fn(line):
                    state = 'keep'
    
    if __name__ == "__main__":
    
        df = open('x.in')
        df = csv.reader(df, skipinitialspace=True)
        df = ignore_comments(
            df,
            lambda x: x == ['header1', 'header2', 'header3'],
            lambda x: x == [],
            False)
    
        df = pd.read_csv(df, engine='python')
        print df
    
    0 讨论(0)
  • 2021-01-13 15:20

    Yes, there is a more pythonic way to do that based on Pandas, (this is a quick demonstration to answer the question)

    import pandas as pd
    from StringIO import StringIO
    
    #define an example to showcase the solution
    st = """blah blah here's a test and
    here's some information  
    you don't care about  
    even a little bit  
    header1, header2, header3  
    1, 2, 3  
    4, 5, 6  
    
    oh you have another test  
    here's some more garbage  
    that's different than the last one  
    this should make  
    life interesting  
    header1, header2, header3  
    7, 8, 9  
    10, 11, 12  
    13, 14, 15""" 
    
    # 1- read the data with pd.read_csv  
    # 2- specify that you want to drop bad lines, error_bad_lines=False 
    # 3- The header has to be the first row of the file. Since this is not the case, let's manually define it with names=[...] and header=None.    
    data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None) 
    
    # the trash will be loaded as follows 
    # blah blah here's a test and       NaN         NaN
    # let's drop these rows 
    data = data.dropna()
    
    # remove the rows which contain "header1","header2", "header3"
    mask = data["header1"].str.contains('header*')
    data = data[~mask]
    print data 
    

    Now your dataFrame looks like this:

       header1 header2 header3
    5        1       2     3  
    6        4       5     6  
    13       7       8     9  
    14      10      11    12  
    15      13      14      15
    
    0 讨论(0)
提交回复
热议问题