I am very new to python, so please be gentle.
I have a .csv file, reported to me in this format, so I cannot do much about it:
ClientAccountID Acc
If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID"
.
So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.
Here's how I'd do it:
"ClientAccountID"
.DictReader
.Here's some code to read the file into a list of StringIOs. (A StringIO
is an in-memory file. It works by wrapping a string up into a file-like interface).
from csv import DictReader
from io import StringIO
stringios = []
with open('file.csv', 'r') as f:
stringio = None
for line in f:
if line.startswith('ClientAccountID'):
if stringio is not None:
stringio.seek(0)
stringios.append(stringio)
stringio = StringIO()
stringio.write(line)
stringio.write("\n")
stringio.seek(0)
stringios.append(stringio)
If we encounter a line starting with 'ClientAccountID'
, we put the current StringIO
into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO
after you've written to it using stringio.seek(0)
.
Now it's straightforward to loop over the StringIO
s to get a table of dictionaries.
data = [list(DictReader(x, delimiter='\t')) for x in stringios]
For each file-like object in the list stringios
, create a DictReader
and read it into a list.
It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.
If your data was not comma or tab delimited you could use str.split
, you can combine it with itertools.groupby
to delimit the headers and rows:
from itertools import groupby, izip, imap
with open("test.txt") as f:
grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))
from pprint import pprint as pp
pp(data)
Output:
[{'AccountAlias': ('SomeAlias', 'OtherAlias'),
'ClientAccountID': ('SomeID', 'OtherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
'FromDate': ('SomeDate', 'OtherDate')},
{'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]
If it is tab delimited just change one line:
with open("test.txt") as f:
grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
for k, v in grps:
if k:
names = next(v)
vals = izip(*next(grps)[1])
data.append(dict(izip(names, vals)))