问题
I'm a beginner in R and I would like to know how to do the following task:
I want to replace the missing values of my dataset by the median for all the columns of my dataset. However, for each column, I want the median of a certain category (depending on another column).My dataset is as follows
structure(list(Country = structure(1:5, .Label = c("Afghanistan",
"Albania", "Algeria", "Andorra", "Angola"), class = "factor"),
CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L), Adolescent.fertility.rate.... = c(151L,
27L, 6L, NA, 146L), Adult.literacy.rate.... = c(28, 98.7,
69.9, NA, 67.4)), class = "data.frame", row.names = c(NA,
-5L))
So for each of the columns, I want to replace the missing values by the median of the values in the specific continent.
回答1:
We can use dplyr::mutate_at
to replace NA
s in each column (except Continent
and the non numeric column Country
) with the median for its Continent
group
df <- structure(list(Country = structure(1:5, .Label = c("Afghanistan", "Albania", "Algeria", "Andorra", "Angola"), class = "factor"),
CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L),
Adolescent.fertility.rate.... = c(151L, 27L, 6L, NA, 146L),
Adult.literacy.rate.... = c(28, 98.7, 69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, -5L))
library(dplyr)
df %>%
group_by(Continent) %>%
mutate_at(vars(-group_cols(), -Country), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>%
ungroup()
Returns:
# A tibble: 5 x 5 Country CountryID Continent Adolescent.fertility.rate.... Adult.literacy.rate.... <fct> <int> <int> <int> <dbl> 1 Afghanistan 1 1 151 28 2 Albania 2 2 27 98.7 3 Algeria 3 3 6 69.9 4 Andorra 4 2 27 98.7 5 Angola 5 3 146 67.4
Explanation:
First we group the data.frame df
by Continent
. Then we mutate all columns except the grouping column (and Country
which is not numeric) the following way: If is.na
is TRUE, we replace it with the median, and since we are grouped, it's going to be the median for the Continent
group (if its not NA
we replace it with itself). Finally we call ungroup
for good measure to get back a 'normal' tibble.
回答2:
Here is a solution using the library dplyr
. I called your dataframe ww
and renamed your column:
library('dplyr')
ww %>%
rename(
lit_rate = Adult.literacy.rate....
) %>%
group_by(
Continent
) %>%
mutate(
lit_rate = replace(
lit_rate,
is.na(lit_rate),
median(lit_rate, na.rm = TRUE)
)
) %>%
ungroup()
来源:https://stackoverflow.com/questions/60564823/how-do-i-get-the-median-of-multiple-columns-in-r-with-conditions-according-to-a