Suppose I have a Pandas DataFrame like the below and I\'m encoding categorical_1 for training in scikit-learn:
data = {\'numeric_1\':[12.1, 3.2, 5.5, 6.8, 9.9],
Isn't this a better answer?
data = pd.DataFrame({
"values": [1, 2, 3, 4, 5, 6, 7],
"categories": ["A", "A", "B", "B", "C", "C", "D"]
})
possibilites = ["A", "B", "C", "D", "E", "F"]
exists = data["categories"].tolist()
difference = pd.Series([item for item in possibilites if item not in exists])
target = data["categories"].append(pd.Series(difference))
target = target.reset_index(drop=True)
dummies = pd.get_dummies(
target
)
dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])