I\'m having a problem with a data set that has 400,000 rows and 300 variables. I have to get dummy variables for a categorical variable with 3,000+ different items. At the e
Update: Starting with version 0.19.0, get_dummies returns an 8bit integer rather than 64bit float, which will fix this problem in many cases and make the as_type
solution below unnecessary. See: get_dummies -- pandas 0.19.0
But in other cases, the sparse
option descibed below may still be helpful.
Original Answer: Here are a couple of possibilities to try. Both will reduce the memory footprint of the dataframe substantially but you could still run into memory issues later. It's hard to predict, you'll just have to try.
(note that I am simplifying the output of info()
below)
df = pd.DataFrame({ 'itemID': np.random.randint(1,4,100) })
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null float64
itemID__2 100 non-null float64
itemID__3 100 non-null float64
memory usage: 3.5 KB
Here's our baseline. Each dummy column takes up 800 bytes because the sample data has 100 rows and get_dummies
appears to default to float64 (8 bytes). This seems like an unnecessarily inefficient way to store dummies as you could use as little as a bit to do it, but there may be some reason for that which I'm not aware of.
So, first attempt, just change to a one byte integer (this doesn't seem to be an option for get_dummies
so it has to be done as a conversion with astype(np.int8)
.
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_').astype(np.int8)],
axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null int8
itemID__2 100 non-null int8
itemID__3 100 non-null int8
memory usage: 1.5 KB
Each dummy column now takes up 1/8 the memory as before.
Alternatively, you can use the sparse
option of get_dummies
.
pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_',sparse=True)],
axis=1).info()
itemID 100 non-null int32
itemID__1 100 non-null float64
itemID__2 100 non-null float64
itemID__3 100 non-null float64
memory usage: 2.0 KB
Fairly comparable savings. The info()
output somewhat hides the way savings are occurring, but you can look at the value of memory usage to see to total savings.
Which of these will work better in practice will depend on your data, so you'll just need to give them each a try (or you could even combine them).