问题
I need to research something about what are the most cost efficient movie genres. My problem is that the genres are provided all within one string:
This gives me about 300 different unique categories. How can I split these into about 12 original dummy genre columns so I can analyse each main genre?
回答1:
Thanks to Yong Wang who suggested the get_dummies function within pandas. We can shorten the code significantly:
df = pd.DataFrame({
'movie_id': range(5),
'gernes': [
'Action|Adventure|Fantasy|Sci-Fi',
'Action|Adventure|Fantasy',
'Action|Adventure|Thriller',
'Action|Thriller',
'Action|Adventure|Sci-Fi'
]
})
dummies = df['gernes'].str.get_dummies(sep='|')
final = pd.concat([df, dummies], axis=1)
Result:
movie_id gernes Action Adventure Fantasy Sci-Fi Thriller
0 0 Action|Adventure|Fantasy|Sci-Fi 1 1 1 1 0
1 1 Action|Adventure|Fantasy 1 1 1 0 0
2 2 Action|Adventure|Thriller 1 1 0 0 1
3 3 Action|Thriller 1 0 0 0 1
4 4 Action|Adventure|Sci-Fi 1 1 0 1 0
Original answer
One solution combining pandas and data preparation techniques from Machine Learning. Assuming you are on pandas v0.25 or later.
First, let's create a dataframe from your screenshot:
df = pd.DataFrame({
'movie_id': range(5),
'gernes': [
'Action|Adventure|Fantasy|Sci-Fi',
'Action|Adventure|Fantasy',
'Action|Adventure|Thriller',
'Action|Thriller',
'Action|Adventure|Sci-Fi'
]
})
movie_id gernes
0 0 Action|Adventure|Fantasy|Sci-Fi
1 1 Action|Adventure|Fantasy
2 2 Action|Adventure|Thriller
3 3 Action|Thriller
4 4 Action|Adventure|Sci-Fi
A movie can belong to multiple gernes. What we want is to seperate those gernes through a process called one-hot encoding. We define the categories (Action, Adventure, Thriller, etc.) and mark each movie as belonging to each category or not:
from sklearn.preprocessing import OneHotEncoder
s = df['gernes'].str.split('|').explode()
encoder = OneHotEncoder()
encoded = encoder.fit_transform(s.values[:, None])
one_hot_df = pd.DataFrame(encoded.toarray(), columns=np.ravel(encoder.categories_), dtype='int') \
.groupby(s.index) \
.sum()
Action Adventure Fantasy Sci-Fi Thriller
0 1 1 1 1 0
1 1 1 1 0 0
2 1 1 0 0 1
3 1 0 0 0 1
4 1 1 0 1 0
What it means is that the first movie belongs to the Action, Adventure, Fantasy and Sci-Fi but not Thriller categories, the second movie belongs to Action, Adventure and Fantasy and so on. The final stop is to combine them together:
final = pd.concat([df, one_hot_df], axis=1)
movie_id gernes Action Adventure Fantasy Sci-Fi Thriller
0 0 Action|Adventure|Fantasy|Sci-Fi 1 1 1 1 0
1 1 Action|Adventure|Fantasy 1 1 1 0 0
2 2 Action|Adventure|Thriller 1 1 0 0 1
3 3 Action|Thriller 1 0 0 0 1
4 4 Action|Adventure|Sci-Fi 1 1 0 1 0
来源:https://stackoverflow.com/questions/58033652/separating-categories-within-one-column-in-my-dataframe