问题
I have a pandas DataFrame
like this:
df = pd.DataFrame({'custid':[1,2,3,4],
...: 'prod1':['jeans','tshirt','jacket','tshirt'],
...: 'prod1_hnode1':[1,2,3,2],
...: 'prod1_hnode2':[6,7,8,7],
...: 'prod2':['tshirt','jeans','jacket','shirt'],
...: 'prod2_hnode1':[2,1,3,4],
...: 'prod2_hnode2':[7,6,8,7]})
In [54]: df
Out[54]:
custid prod1 prod1_hnode1 prod1_hnode2 prod2 prod2_hnode1 \
0 1 jeans 1 6 tshirt 2
1 2 tshirt 2 7 jeans 1
2 3 jacket 3 8 jacket 3
3 4 tshirt 2 7 shirt 4
prod2_hnode2
0 7
1 6
2 8
3 7
How can I convert this to the following format:
dfnew = pd.DataFrame({'custid':[1,1,2,2,3,3,4,4],
...: 'prod':['prod1','prod2','prod1','prod2','prod1','prod2','prod1','prod2'],
...: 'rec':['jeans','tshirt','tshirt','jeans','jacket','jacket','tshirt','shirt'],
...: 'hnode1':[1,2,2,1,3,3,2,4],
...: 'hnode2':[6,7,7,6,8,8,7,7]})
In [56]: dfnew
Out[56]:
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
回答1:
Use:
- set_index by column
custid
- create
MultiIndex
in columns by split - replace
NaN
s in columns byrec
- stack by first level
- reset_index for columns from
MultiIndex
- rename column
df = df.set_index('custid')
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns={np.nan:'rec'})
cols = ['custid','hnode1','hnode2','prod','rec']
df = df.stack(0).reset_index().rename(columns={'level_1':'prod'}).reindex(columns=cols)
print (df)
custid hnode1 hnode2 prod rec
0 1 1 6 prod1 jeans
1 1 2 7 prod2 tshirt
2 2 2 7 prod1 tshirt
3 2 1 6 prod2 jeans
4 3 3 8 prod1 jacket
5 3 3 8 prod2 jacket
6 4 2 7 prod1 tshirt
7 4 4 7 prod2 shirt
回答2:
Here's another approach that should work, but uses repeated melt
s.
coln = df.dtypes.index # save some typing
df_long = pd.melt(
df, id_vars = "custid", value_vars = ["prod1", "prod2"],
var_name = "prod", value_name = "rec").assign(
hnode1 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode1" in x, coln))["value"],
hnode2 = pd.melt(df, id_vars = "custid",
value_vars = filter(lambda x: "hnode2" in x, coln))["value"])
print(df_long)
custid prod rec hnode1 hnode2
0 1 prod1 jeans 1 6
1 2 prod1 tshirt 2 7
2 3 prod1 jacket 3 8
3 4 prod1 tshirt 2 7
4 1 prod2 tshirt 2 7
5 2 prod2 jeans 1 6
6 3 prod2 jacket 3 8
7 4 prod2 shirt 4 7
You mention R in your comments. melt
from "data.table" should be able to handle this much more easily since you can melt multiple sets of columns at once, similar to how you might approach the problem with base R's reshape
function.
The base R approach might be something along the lines of:
reshape(df, direction = "long", idvar = "custid",
varying = list(c(2, 5), c(3, 6), c(4, 7)),
sep = "", times = c("prod1", "prod2"))
来源:https://stackoverflow.com/questions/48223769/melt-a-pandas-dataframe