How can we replace missing
values with 0.0
for a column in a DataFrame
?
This is a shorter and more updated answer since Julia introduced the missing
attribute recently.
using DataFrames
df = DataFrame(A=rand(1:50, 5), B=rand(1:50, 5), C=vcat(rand(1:50,3), missing, rand(1:50))) ## Creating random 5 integers within the range of 1:50, while introducing a missing variable in one of the rows
df = DataFrame(replace!(convert(Matrix, df), missing=>0)) ## Converting to matrix first, since replacing values directly within type dataframe is not allowed
There are a few different approaches to this problem (valid for Julia 1.x):
Probably the easiest approach is to use replace!
or replace
from base Julia. Here is an example with replace!
:
julia> using DataFrames
julia> df = DataFrame(x = [1, missing, 3])
3×1 DataFrame
│ Row │ x │
│ │ Int64⍰ │
├─────┼─────────┤
│ 1 │ 1 │
│ 2 │ missing │
│ 3 │ 3 │
julia> replace!(df.x, missing => 0);
julia> df
3×1 DataFrame
│ Row │ x │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
However, note that at this point the type of column x
still allows missing values:
julia> typeof(df.x)
Array{Union{Missing, Int64},1}
This is also indicated by the question mark following Int64
in column x
when the data frame is printed out. You can change this by using disallowmissing!
(from the DataFrames.jl package):
julia> disallowmissing!(df, :x)
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
Alternatively, if you use replace
(without the exclamation mark) as follows, then the output will already disallow missing values:
julia> df = DataFrame(x = [1, missing, 3]);
julia> df.x = replace(df.x, missing => 0);
julia> df
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
You can use ismissing
with logical indexing to assign a new value to all missing entries of an array:
julia> df = DataFrame(x = [1, missing, 3]);
julia> df.x[ismissing.(df.x)] .= 0;
julia> df
3×1 DataFrame
│ Row │ x │
│ │ Int64⍰ │
├─────┼────────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
Another approach is to use coalesce
:
julia> df = DataFrame(x = [1, missing, 3]);
julia> df.x = coalesce.(df.x, 0);
julia> df
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
Both replace
and coalesce
can be used with the @transform
macro from the DataFramesMeta.jl package:
julia> using DataFramesMeta
julia> df = DataFrame(x = [1, missing, 3]);
julia> @transform(df, x = replace(:x, missing => 0))
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
julia> df = DataFrame(x = [1, missing, 3]);
julia> @transform(df, x = coalesce.(:x, 0))
3×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 0 │
│ 3 │ 3 │
The other answers are pretty good all over. If you are a real speed junky, perhaps the following might be for you:
# prepare example
using DataFrames
df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
df[ df[:A] %2 .== 0, :B ] = NA
df[:B].data[df[:B].na] = 0.0 # put the 0.0 into NAs
df[:B] = df[:B].data # with no NAs might as well use array
create df
with some NA
s
using DataFrames
df = DataFrame(A = 1.0:10.0, B = 2.0:2.0:20.0)
df[ df[:B] %2 .== 0, :A ] = NA
you'll see some NA
in df
... we now convert them to 0.0
df[ isna(df[:A]), :A] = 0
EDIT=NaN
→NA
. Thanks @Reza