Not reading all rows while importing csv into pandas dataframe

后端 未结 1 1551
孤独总比滥情好
孤独总比滥情好 2021-01-18 11:46

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step. My limited python knowledge has to be blamed for this. I am trying to read the data

相关标签:
1条回答
  • 2021-01-18 12:49

    I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link

    import pandas as pd
    import csv
    
    test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
    
    print (test.shape)
    #(381422, 22)
    

    But some data (problematic) will be skipped.

    If you want skip emails body data, you can use:

    import pandas as pd
    import csv
    
    test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE,  sep=',', error_bad_lines=False, header=None,
        names=["Id","DocNumber","MetadataSubject","MetadataTo","MetadataFrom","SenderPersonId","MetadataDateSent","MetadataDateReleased","MetadataPdfLink","MetadataCaseNumber","MetadataDocumentClass","ExtractedSubject","ExtractedTo","ExtractedFrom","ExtractedCc","ExtractedDateSent","ExtractedCaseNumber","ExtractedDocNumber","ExtractedDateReleased","ExtractedReleaseInPartOrFull","ExtractedBodyText","RawText"])
    
    print (test.shape)
    
    #delete row with NaN in column MetadataFrom
    test = test.dropna(subset=['MetadataFrom'])
    #delete headers in data
    test = test[test.MetadataFrom != 'MetadataFrom']
    
    0 讨论(0)
提交回复
热议问题