I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:
dataset = np.genfromtxt(\"data.csv\", delimiter=\"
The question of how to deal with mixed data from genfromtxt
comes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.
All the examples in the genfromtxt
doc show this:
>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])
But let me demonstrate how to access this kind of data
In [361]: txt=b"""A, 1,2,3
...: B,4,5,6
...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]:
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)],
dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])
So my array has 2 records (check the shape), which are displayed as tuples in a list.
You access fields
by name, not by column number (do I need to add a structured array documentation link?)
In [364]: data['f0']
Out[364]:
array([b'A', b'B'],
dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])
In a case like this might be more useful if I choose a dtype
with 'subarrays'. This a more advanced dtype topic
In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]:
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])],
dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]:
array([[ 1., 2., 3.],
[ 4., 5., 6.]])
The character column is still loaded as S1
, but the numbers are now in a 3 column array. Note that they are all float (or int).
In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2,
minmax=(array([ 1., 2., 3.]), array([ 4., 5., 6.])),
mean=array([ 2.5, 3.5, 4.5]),
variance=array([ 4.5, 4.5, 4.5]),
skewness=array([ 0., 0., 0.]),
kurtosis=array([-2., -2., -2.]))
import pandas as pd
import numpy as np
df_describe = pd.DataFrame(dataset)
df_describe.describe()
please note that dataset is your np.array to describe.
import pandas as pd
import numpy as np
df_describe = pd.DataFrame('your np.array')
df_describe.describe()
This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void
), which cannot be described by stats as it includes multiple different types, incl. strings.
This could be resolved by either reading it in two rounds, or using pandas with read_csv.
If you decide to stick to numpy
:
import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')
from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)
Note that in this example the final array has dtype
as float
, not int
, but can easily (if necessary) be converted to int using arr.astype(int)