问题
I am trying to process a dataframe. This includes creating new columns and updating their values based on the values in other columns. More concretely, I have a predefined "source" that I want to classify. This source can fall under three different categories 'source_dtp', 'source_dtot', and 'source_cash'. I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the original "source" column.
I am currently able to do this, it's just really slow...
Original column sample:
source
_id
AV4MdG6Ihowv-SKBN_nB DTP
AV4Mc2vNhowv-SKBN_Rn Cash 1
AV4MeisikOpWpLdepWy6 DTP
AV4MeRh6howv-SKBOBOn Cash 1
AV4Mezwchowv-SKBOB_S DTOT
AV4MeB7yhowv-SKBOA5b DTP
Desired output:
source_dtp source_dtot source_cash
_id
AV4MdG6Ihowv-SKBN_nB 1.0 0.0 0.0
AV4Mc2vNhowv-SKBN_Rn 0.0 0.0 1.0
AV4MeisikOpWpLdepWy6 1.0 0.0 0.0
AV4MeRh6howv-SKBOBOn 0.0 0.0 1.0
AV4Mezwchowv-SKBOB_S 0.0 1.0 0.0
AV4MeB7yhowv-SKBOA5b 1.0 0.0 0.0
This is my current approach, but it's very slow. I would much prefer a vectorized form of doing this but I don't know how - as the condition is very elaborate.
# For 'source' we will use the following classes:
source_cats = ['source_dtp', 'source_dtot', 'source_cash']
# [0, 0, 0] would imply 'other', hence no need for a fourth category
# add new features to dataframe, initializing to nan
for cat in source_cats:
data[cat] = np.nan
for row in data.itertuples():
# create series to hold the result per row e.g. [1, 0, 0] for `cash`
cat = [0, 0, 0]
index = row[0]
# to string as some entries are numerical
source_type = str(data.loc[index, 'source']).lower()
if 'dtp' in source_type:
cat[0] = 1
if 'dtot' in source_type:
cat[1] = 1
if 'cash' in source_type:
cat[2] = 1
data.loc[index, source_cats] = cat
I am using itertuples() as it proved faster than interrows().
Is there a faster way of achieving the same functionality as above?
EDIT: This is not just with regards to creating a one hot encoding. It boils down to updating the column values dependent on the value of another column. E.g. if I have a certain location_id
I want to update its respective longitude
and latitude
columns - based on that original id (without iterating in the way that I do above because it's really slow for large datasets).
回答1:
Another way to do this is to use pd.get_dummies on the dataframe. First put '_id' into the index.
source = source.set_index('_id')
df_out = pd.get_dummies(source).reset_index()
print(df_out)
Output:
_id source_Cash 1 source_DTOT source_DTP
0 AV4MdG6Ihowv-SKBN_nB 0 0 1
1 AV4Mc2vNhowv-SKBN_Rn 1 0 0
2 AV4MeisikOpWpLdepWy6 0 0 1
3 AV4MeRh6howv-SKBOBOn 1 0 0
4 AV4Mezwchowv-SKBOB_S 0 1 0
5 AV4MeB7yhowv-SKBOA5b 0 0 1
回答2:
You can use str.get_dummies
to get your OHEncodings.
c = df.source.str.get_dummies().add_prefix('source_').iloc[:, ::-1]
c.columns = c.columns.str.lower().str.split().str[0]
print(c)
source_dtp source_dtot source_cash
0 1 0 0
1 0 0 1
2 1 0 0
3 0 0 1
4 0 1 0
5 1 0 0
Next, concatenate c
with _id
using pd.concat
.
df = pd.concat([df._id, c], 1)
print(df)
_id source_dtp source_dtot source_cash
0 AV4MdG6Ihowv-SKBN_nB 1 0 0
1 AV4Mc2vNhowv-SKBN_Rn 0 0 1
2 AV4MeisikOpWpLdepWy6 1 0 0
3 AV4MeRh6howv-SKBOBOn 0 0 1
4 AV4Mezwchowv-SKBOB_S 0 1 0
5 AV4MeB7yhowv-SKBOA5b 1 0 0
Improvement! Now slightly smoother, thanks to Scott Boston's set_index
- reset_index
paradigm:
df = df.set_index('_id')\
.source.str.get_dummies().iloc[:, ::-1]
df.columns = df.columns.str.lower().str.split().str[0]
df = df.add_prefix('source_').reset_index()
print(df)
_id source_dtp source_dtot source_cash
0 AV4MdG6Ihowv-SKBN_nB 1 0 0
1 AV4Mc2vNhowv-SKBN_Rn 0 0 1
2 AV4MeisikOpWpLdepWy6 1 0 0
3 AV4MeRh6howv-SKBOBOn 0 0 1
4 AV4Mezwchowv-SKBOB_S 0 1 0
5 AV4MeB7yhowv-SKBOA5b 1 0 0
来源:https://stackoverflow.com/questions/46678400/is-there-a-faster-way-to-update-dataframe-column-values-based-on-conditions