I am working with a df and using numpy to transform data - including setting blanks (or \'\') to NaN. But when I write the df to csv - the output contains the string \'nan\' as
In my situation, the culprit was np.where
. When the data types of the two return elements are different, then your np.NaN
will be converted to a nan
.
It's hard (for me) to see exactly what's going on under the hood, but I suspect this might be true for other Numpy array methods that have mixed types.
A minimal example:
import numpy as np
import pandas as pd
seq = [1, 2, 3, 4, np.NaN]
same_type_seq = np.where("parrot"=="dead", 0, seq)
diff_type_seq = np.where("parrot"=="dead", "spam", seq)
pd.Series(seq).to_csv("vanilla_nan.csv", header=False) # as expected, last row is blank
pd.Series(same_type_seq).to_csv("samey_nan.csv", header=False) # also, blank
pd.Series(diff_type_seq).to_csv("nany_nan.csv", header=False) # nan instead of blank
So how to get round this? I'm not too sure, but as a hacky workaround for small datasets, you can replace NaN
in your original sequence with a token string and then replace it back to np.NaN
repl = "missing"
hacky_seq = np.where("parrot"=="dead", "spam", [repl if np.isnan(x) else x for x in seq])
pd.Series(hacky_seq).replace({repl:np.NaN}).to_csv("hacky_nan.csv", header=False)