Pandas dataframe read_csv on bad data

后端 未结 3 1914
庸人自扰
庸人自扰 2020-12-02 22:13

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro

相关标签:
3条回答
  • 2020-12-02 22:19

    To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                            warn_bad_lines=True, error_bad_lines=False)
    

    error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

    'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'
    

    If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

    import contextlib
    
    with open(r'D:\Temp\log.txt', 'w') as log:
        with contextlib.redirect_stderr(log):
            dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                    warn_bad_lines=True, error_bad_lines=False)
    
    0 讨论(0)
  • 2020-12-02 22:31

    here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.

    def bad_lines(path):
        import itertools
        num_columns = []
        lines = ""
        
        for i in range(10,50,5):
            content = open(path).readlines(i)[0]
            if (content.count("'") == 0) and (content.count('"') == 0):
                num_columns.append(content.count(","))
    
        if len(set(num_columns)) == 1:
            for line in itertools.islice(open(path), 0, None):
                if line.count(",") >= num_columns[0]:
                    lines = lines + line
    
        text_file = open("temp.txt", "w")
        n = text_file.write(lines)
        text_file.close()
        
        return("temp.txt")
    
    0 讨论(0)
  • 2020-12-02 22:41

    pass error_bad_lines=False to skip erroneous rows:

    error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

    0 讨论(0)
提交回复
热议问题