Python/ Pandas CSV Parsing

后端 未结 2 576
南方客
南方客 2021-01-28 02:34

I used JotForm Configurable list widget to collect data, but having troubles parsing the resulting data correctly. When I use

testdf = pd.read_csv (\"TestLoad.c         


        
2条回答
  •  借酒劲吻你
    2021-01-28 02:57

    This is useless text that is required to keep an answer from being downvoted by the moderators. Here is the data I used:

    "Date","Information","Type"
    "2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones,  School: MCAA;","Old"
    "2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"
    

    import pandas as pd
    import numpy as np
    import csv
    import re
    import itertools as it
    import pprint
    import datetime as dt
    
    records = [] #Construct a complete record for each person
    
    colon_pairs = r"""
        (\w+)   #Match a 'word' character, one or more times, captured in group 1, followed by..
        :       #A colon, followed by...
        \s*     #Whitespace, 0 or more times, followed by...
        (\w+)   #A 'word' character, one or more times, captured in group 2.
    """
    
    colon_pairs_per_person = 3
    
    with open("csv1.csv", encoding='utf-8') as f:
        next(f) #skip header line
        record = {}
    
        for date, info, the_type in csv.reader(f):
            info_parser = re.finditer(colon_pairs, info, flags=re.X)
    
            for i, match_obj in enumerate(info_parser):
                key, val = match_obj.groups()
                record[key] = val
    
                if (i+1) % colon_pairs_per_person == 0: #then done with info for a person
                    record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date.
                    record['Type'] = the_type
    
                    records.append(record)
                    record = {}
    
    pprint.pprint(records)
    df = pd.DataFrame(
            sorted(records, key=lambda record: record['Date'])
    )
    print(df)
    df.set_index('Date', inplace=True)
    print(df)
    
    --output:--
    [{'Date': datetime.datetime(2015, 12, 7, 0, 0),
      'First': 'Jim',
      'Last': 'Jones',
      'School': 'MCAA',
      'Type': 'Old'},
     {'Date': datetime.datetime(2015, 12, 7, 0, 0),
      'First': 'Jane',
      'Last': 'Jones',
      'School': 'MCAA',
      'Type': 'Old'},
     {'Date': datetime.datetime(2015, 12, 6, 0, 0),
      'First': 'Tom',
      'Last': 'Smith',
      'School': 'MCAA',
      'Type': 'New'},
     {'Date': datetime.datetime(2015, 12, 6, 0, 0),
      'First': 'Tammy',
      'Last': 'Smith',
      'School': 'MCAA',
      'Type': 'New'}]
    
            Date  First   Last School Type
    0 2015-12-06    Tom  Smith   MCAA  New
    1 2015-12-06  Tammy  Smith   MCAA  New
    2 2015-12-07    Jim  Jones   MCAA  Old
    3 2015-12-07   Jane  Jones   MCAA  Old
    
                First   Last School Type
    Date                                
    2015-12-06    Tom  Smith   MCAA  New
    2015-12-06  Tammy  Smith   MCAA  New
    2015-12-07    Jim  Jones   MCAA  Old
    2015-12-07   Jane  Jones   MCAA  Old
    

提交回复
热议问题