Splitting Rows in csv on several header rows

前端 未结 2 1405
一向
一向 2021-01-03 15:52

I am very new to python, so please be gentle.

I have a .csv file, reported to me in this format, so I cannot do much about it:

ClientAccountID   Acc         


        
相关标签:
2条回答
  • 2021-01-03 16:16

    If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".

    So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.

    Here's how I'd do it:

    1. Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
    2. Read each of these files into a list of dictionaries using a DictReader.

    Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).

    from csv import DictReader
    from io import StringIO
    
    stringios = []
    
    with open('file.csv', 'r') as f:
        stringio = None
        for line in f:
            if line.startswith('ClientAccountID'):
                if stringio is not None:
                    stringio.seek(0)
                    stringios.append(stringio)
                stringio = StringIO()
            stringio.write(line)
            stringio.write("\n")
        stringio.seek(0)
        stringios.append(stringio)
    

    If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too. Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0).

    Now it's straightforward to loop over the StringIOs to get a table of dictionaries.

    data = [list(DictReader(x, delimiter='\t')) for x in stringios]
    

    For each file-like object in the list stringios, create a DictReader and read it into a list.

    It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.

    0 讨论(0)
  • 2021-01-03 16:22

    If your data was not comma or tab delimited you could use str.split, you can combine it with itertools.groupby to delimit the headers and rows:

    from itertools import groupby, izip, imap
    
    with open("test.txt") as f:
        grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
        for k, v in grps:
            if k:
                names = next(v)
                vals = izip(*next(grps)[1])
                data.append(dict(izip(names, vals)))
    
    from pprint import pprint as pp
    
    pp(data)
    

    Output:

    [{'AccountAlias': ('SomeAlias', 'OtherAlias'),
      'ClientAccountID': ('SomeID', 'OtherID'),
      'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
      'FromDate': ('SomeDate', 'OtherDate')},
     {'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
      'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
      'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
      'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]
    

    If it is tab delimited just change one line:

    with open("test.txt") as f:
        grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
        for k, v in grps:
            if k:
                names = next(v)
                vals = izip(*next(grps)[1])
                data.append(dict(izip(names, vals)))
    
    0 讨论(0)
提交回复
热议问题