I have a file.csv
with ~15k rows that looks like this
SAMPLE_TIME, POS, OFF, HISTOGRAM
2015-07-15 16:41:56, 0-0-0-0-3, 1,
Assuming your data is in a file called foo.csv, you could do the following. This was tested against Pandas 0.17
df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)
You can create columns based on the length of the first actual row:
from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
h, ln = next(f), len(next(f).split(","))
header = h.strip().split(",")
f.seek(0), next(f)
header += range(ln)
print(pd.read_csv(f, names=header))
Which will give you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 13 14 15 16 17 18 19 20 21 22
0 0 0 ... 0 0 0 0 0 NaN NaN NaN NaN NaN
1 0 0 ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 0 ... 4 0 0 0 NaN NaN NaN NaN NaN NaN
3 0 0 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
[4 rows x 27 columns]
Or you could clean the file before passing to pandas:
import pandas as pd
from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
for line in f:
t.write(line.replace(" ", ""))
t.seek(0)
ln = len(line.strip().split(","))
header = t.readline().strip().split(",")
header += range(ln)
print(pd.read_csv(t,names=header))
Which gives you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 4 5 ... 11 \
0 2015-07-1516:41:56 0-0-0-0-3 1 2 0 5 59 0 0 0 ... 0
1 2015-07-1516:42:55 0-0-0-0-3 1 0 0 5 9 0 0 0 ... 0
2 2015-07-1516:43:55 0-0-0-0-3 1 0 0 5 5 0 0 0 ... 0
3 2015-07-1516:44:56 0-0-0-0-3 1 2 0 5 0 0 0 0 ... 0
12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 NaN NaN NaN
1 50 0 NaN NaN NaN NaN NaN NaN NaN
2 0 4 0 0 0 NaN NaN NaN NaN
3 6 0 0 0 0 NaN NaN NaN NaN
[4 rows x 25 columns]
or to drop the columns will all nana:
print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
Gives you:
SAMPLE_TIME POS OFF HISTOGRAM 0 1 2 3 \
0 2015-07-15 16:41:56 0-0-0-0-3 1 2 0 5 59 0
1 2015-07-15 16:42:55 0-0-0-0-3 1 0 0 5 9 0
2 2015-07-15 16:43:55 0-0-0-0-3 1 0 0 5 5 0
3 2015-07-15 16:44:56 0-0-0-0-3 1 2 0 5 0 0
4 5 ... 8 9 10 11 12 13 14 15 16 17
0 0 0 ... 2 0 0 0 0 0 0 0 0 0
1 0 0 ... 2 0 0 0 50 0 NaN NaN NaN NaN
2 0 0 ... 2 0 0 0 0 4 0 0 0 NaN
3 0 0 ... 2 0 0 0 6 0 0 0 0 NaN
[4 rows x 22 columns]
You can split column HISTOGRAM
to new DataFrame
and concat it to original.
print df
SAMPLE_TIME, POS, OFF, \
0 2015-07-15 16:41:56 0-0-0-0-3, 1,
1 2015-07-15 16:42:55 0-0-0-0-3, 1,
2 2015-07-15 16:43:55 0-0-0-0-3, 1,
3 2015-07-15 16:44:56 0-0-0-0-3, 1,
HISTOGRAM
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0
#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 2 0 5 59 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
1 0 0 5 9 0 0 0 0 0 2 0 0 0 50 0 None None None None
2 0 0 5 5 0 0 0 0 0 2 0 0 0 0 4 0 0 0 None
3 2 0 5 0 0 0 0 0 0 2 0 0 0 6 0 0 0 0 None None
#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
print df
HISTOGRAM HISTOGRAM 1 2 3 4 5 ... 10 \
0 2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0, 2 0 5 59 0 0 ... 0
1 0,0,5,9,0,0,0,0,0,2,0,0,0,50,0, 0 0 5 9 0 0 ... 0
2 0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0, 0 0 5 5 0 0 ... 0
3 2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0 2 0 5 0 0 0 ... 0
11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0
1 0 0 50 0 None None None None
2 0 0 0 4 0 0 0 None
3 0 0 6 0 0 0 0 None None
[4 rows x 24 columns]
So how about this. I made a csv from your sample data.
When I import lines:
with open('test.csv','rb') as f:
lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]
to generate nice header names, use this line:
headers = [i or ind for ind, i in enumerate(headers)]
so because of how (I assume) csv works, headers should have a bunch of empty string values. empty strings evaluate to False, so this comprehension returns numbered columns for each column without a header.
Then just make a df:
df = pd.DataFrame(values,columns=headers)
which looks like:
11: SAMPLE_TIME POS OFF HISTOGRAM 4 5 6 7 8 9 \
0 15/07/2015 16:41 0-0-0-0-3 1 2 0 5 59 0 0 0
1 15/07/2015 16:42 0-0-0-0-3 1 0 0 5 9 0 0 0
2 15/07/2015 16:43 0-0-0-0-3 1 0 0 5 5 0 0 0
3 15/07/2015 16:44 0-0-0-0-3 1 2 0 5 0 0 0 0
... 12 13 14 15 16 17 18 19 20 21
0 ... 2 0 0 0 0 0 0 0 0 0
1 ... 2 0 0 0 50 0
2 ... 2 0 0 0 0 4 0 0 0
3 ... 2 0 0 0 6 0 0 0 0
[4 rows x 22 columns]