How to take column-slices of dataframe in pandas

匿名 (未验证) 提交于 2019-12-03 02:14:01

问题:

I load a some machine learning data from a csv file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following :

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I'd like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c'] features = data['c':]

I'm not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]

回答1:

2017 Answer - pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let's assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat' df.loc[:, 'foo':'sat'] # foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column df.loc[:, 'foo':'cat':2] # foo quz cat  # slice from the beginning to 'bar' df.loc[:, :'bar'] # foo bar  # slice from 'quz' to the end by 3 df.loc[:, 'quz'::3] # quz sat  # attempt from 'sat' to 'bar' df.loc[:, 'sat':'bar'] # no columns returned  # slice from 'sat' to 'bar' df.loc[:, 'sat':'bar':-1] sat cat ant quz bar  # slice notation is syntatic sugar for the slice function # slice from 'quz' to the end by 2 with slice function df.loc[:, slice('quz',None, 2)] # quz cat dat  # select specific columns with a list # select columns foo, bar and dat df.loc[:, ['foo','bar','dat']] # foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3 df.loc['w':'y', 'foo':'ant':3] #    foo ant # w # x # y


回答2:

The DataFrame.ix index is what you want to be accessing. It's a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde')) >>> df.ix[:,'b':]       b         c         d         e 0  0.418762  0.042369  0.869203  0.972314 1  0.991058  0.510228  0.594784  0.534366 2  0.407472  0.259811  0.396664  0.894202 3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced



回答3:

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn) >> import seaborn.apionly as sns >> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])


回答4:

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d 0  0.883283  0.100975 1  0.614313  0.221731 2  0.438963  0.224361 3  0.466078  0.703347 4  0.955285  0.114033 5  0.268443  0.416996 6  0.613241  0.327548 7  0.370784  0.359159 8  0.692708  0.659410 9  0.806624  0.875476


回答5:

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde')) data_ab = data[list('ab')] data_cde = data[list('cde')]


回答6:

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])] print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).



回答7:

Here's how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd     In [38]: import numpy as np In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))  In [44]: df Out[44]:            a         b         c         d         e         f         g 0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633 1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268 2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305 3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806  In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing  Out[45]:            a         b         c 0  0.409038  0.745497  0.890767 1  0.570642  0.181552  0.794599 2  0.568440  0.501638  0.186635 3  0.679125  0.642817  0.697628  In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing  Out[46]:            a         b         c 0  0.409038  0.745497  0.890767 1  0.570642  0.181552  0.794599 2  0.568440  0.501638  0.186635 3  0.679125  0.642817  0.697628  In [47]: df.iloc[:, 0:3] ## index based column ranges slicing  Out[47]:            a         b         c 0  0.409038  0.745497  0.890767 1  0.570642  0.181552  0.794599 2  0.568440  0.501638  0.186635 3  0.679125  0.642817  0.697628  ### with 2 different column ranges, index based slicing:  In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()] Out[49]:            a         b         c 0  0.409038  0.745497  0.890767 1  0.570642  0.181552  0.794599 2  0.568440  0.501638  0.186635 3  0.679125  0.642817  0.697628


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!