问题
I have the following problem. Lets say this is my CSV
id f1 f2 f3
1 4 5 5
1 3 1 0
1 7 4 4
1 4 3 1
1 1 4 6
2 2 6 0
..........
So, I have rows which can be grouped by id. I want to create a csv like below as an output.
f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4 5 5 3 1 0 7 4 4 1 4 6
So, I want to be able to chose the number of rows I will grab to convert into columns (always starting from the first row of an id). In this case I grabbed 3 rows. I will also then skip one or more rows (in this case only one skip) to take the final columns from the last row of the same id group. And for reasons, I want to use a data frame.
After struggling for a 3-4 hours. I found out a solution as given below. But my solution is very slow. I have about 700,000 rows and may be around 70,000 groups of ids. The code above at model=3 takes almost an hour on my 4GB 4 Core Lenovo. I need to go to model = maybe 10 or 15. I am still novice in Python and I am sure there can be several changes that will make this fast. Can some one explain deeply how I can improve on the code.
Thanks a ton.
model : number of rows to grab
# train data frame from reading the csv
train = pd.read_csv(filename)
# Get groups of rows with same id
csv_by_id = train.groupby('id')
modelTarget = { 'f1_t','f2_t','f3_t'}
# modelFeatures is a list of features I am interested in the csv.
# The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }
coreFeatures = list(modelFeatures) # cloning
selectedFeatures = list(modelFeatures) # cloning
newFeatures = list(selectedFeatures) # cloning
finalFeatures = list(selectedFeatures) # cloning
# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
newFeatures = [s + '_n' for s in newFeatures]
finalFeatures = finalFeatures + newFeatures
# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget)
# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)
for id_group in csv_by_id:
#id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
group_data = id_group[1]
#hmm - can this be better? I am picking up the rows which I need from first row on wards
df = group_data[coreFeatures][0:model]
# initialize a list
tmp = []
# now keep adding the column values into the list
for index, row in df.iterrows():
tmp = tmp + list(row)
# Wow, this one below surely should have something better.
# So i am picking up the feature column values from the last row of the group of rows for a particular id
targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values
# Think this can be done easier too ? . Basically adding the values to the tmp list again
tmp = tmp + list(targetValues.flatten())
# coverting the list to a dict.
tmpDict = dict(zip(selectedFeatures,tmp))
# then the dict to a dataframe.
tmpDf = pd.DataFrame(tmpDict,index={1})
# I just could not find a better way of adding a dict or list directly into a dataframe.
# And I went through lots and lots of blogs on this topic, including some in StackOverflow.
# finally I add the frame to my main frame
model_data = model_data.append(tmpDf)
# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)
回答1:
Groupby is your friend.
This will scale very well; only a small constant in the number of features. It will be roughly O(number of groups)
In [28]: features = ['f1','f2','f3']
Create some test data, group sizes are 7-12, 70k groups
In [29]: def create_df(i):
....: l = np.random.randint(7,12)
....: df = DataFrame(dict([ (f,np.arange(l)) for f in features ]))
....: df['A'] = i
....: return df
....:
In [30]: df = concat([ create_df(i) for i in xrange(70000) ])
In [39]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 629885 entries, 0 to 9
Data columns (total 4 columns):
f1 629885 non-null int64
f2 629885 non-null int64
f3 629885 non-null int64
A 629885 non-null int64
dtypes: int64(4)
Create a frame where you select the first 3 rows and the final row from each group (note that this WILL handle groups of size < 4, however your final row may overlap another, you may wish to do a groupby.filter
to remedy this)
In [31]: groups = concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
# This step is necesary in pandas < master/0.14 as the returned fields
# will include the grouping field (the A), (is a bug/API issue)
In [33]: groups = groups[features]
In [34]: groups.head(20)
Out[34]:
f1 f2 f3
A
0 0 0 0 0
1 1 1 1
2 2 2 2
7 7 7 7
1 0 0 0 0
1 1 1 1
2 2 2 2
9 9 9 9
2 0 0 0 0
1 1 1 1
2 2 2 2
8 8 8 8
3 0 0 0 0
1 1 1 1
2 2 2 2
8 8 8 8
4 0 0 0 0
1 1 1 1
2 2 2 2
9 9 9 9
[20 rows x 3 columns]
In [38]: groups.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280000 entries, (0, 0) to (69999, 9)
Data columns (total 3 columns):
f1 280000 non-null int64
f2 280000 non-null int64
f3 280000 non-null int64
dtypes: int64(3)
And pretty fast
In [32]: %timeit concat([df.groupby('A').head(3),df.groupby('A').tail(1)]).sort_index()
1 loops, best of 3: 1.16 s per loop
For further manipulation you usually should stop here and use this (as its in a nice grouped format that's easy to deal with).
If you want to translate this to a wide format
In [35]: dfg = groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
In [36]: %timeit groups.groupby(level=0).apply(lambda x: Series(x.values.ravel()))
dfg.head()
groups.info()
1 loops, best of 3: 14.5 s per loop
In [40]: dfg.columns = [ "{0}_{1}".format(f,i) for i in range(1,5) for f in features ]
In [41]: dfg.head()
Out[41]:
f1_1 f2_1 f3_1 f1_2 f2_2 f3_2 f1_3 f2_3 f3_3 f1_4 f2_4 f3_4
A
0 0 0 0 1 1 1 2 2 2 7 7 7
1 0 0 0 1 1 1 2 2 2 9 9 9
2 0 0 0 1 1 1 2 2 2 8 8 8
3 0 0 0 1 1 1 2 2 2 8 8 8
4 0 0 0 1 1 1 2 2 2 9 9 9
[5 rows x 12 columns]
In [42]: dfg.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 69999
Data columns (total 12 columns):
f1_1 70000 non-null int64
f2_1 70000 non-null int64
f3_1 70000 non-null int64
f1_2 70000 non-null int64
f2_2 70000 non-null int64
f3_2 70000 non-null int64
f1_3 70000 non-null int64
f2_3 70000 non-null int64
f3_3 70000 non-null int64
f1_4 70000 non-null int64
f2_4 70000 non-null int64
f3_4 70000 non-null int64
dtypes: int64(12)
来源:https://stackoverflow.com/questions/23580009/data-processing-with-adding-columns-dynamically-in-python-pandas-dataframe