How do I get the median of multiple columns in R with conditions (according to another column)

白昼怎懂夜的黑 提交于 2020-03-16 06:50:07

问题


I'm a beginner in R and I would like to know how to do the following task:

I want to replace the missing values of my dataset by the median for all the columns of my dataset. However, for each column, I want the median of a certain category (depending on another column).My dataset is as follows

structure(list(Country = structure(1:5, .Label = c("Afghanistan", 
"Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
    CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L), Adolescent.fertility.rate.... = c(151L, 
    27L, 6L, NA, 146L), Adult.literacy.rate.... = c(28, 98.7, 
    69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, 
-5L))

So for each of the columns, I want to replace the missing values by the median of the values in the specific continent.


回答1:


We can use dplyr::mutate_at to replace NAs in each column (except Continent and the non numeric column Country) with the median for its Continent group

df <- structure(list(Country = structure(1:5, .Label = c("Afghanistan",  "Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
               CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L),
               Adolescent.fertility.rate.... = c(151L, 27L, 6L, NA, 146L),
               Adult.literacy.rate.... = c(28, 98.7, 69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, -5L))

library(dplyr)
df %>%
  group_by(Continent) %>% 
  mutate_at(vars(-group_cols(), -Country), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>% 
  ungroup()

Returns:

  # A tibble: 5 x 5
    Country     CountryID Continent Adolescent.fertility.rate.... Adult.literacy.rate....
    <fct>           <int>     <int>                         <int>                   <dbl>
  1 Afghanistan         1         1                           151                    28  
  2 Albania             2         2                            27                    98.7
  3 Algeria             3         3                             6                    69.9
  4 Andorra             4         2                            27                    98.7
  5 Angola              5         3                           146                    67.4

Explanation: First we group the data.frame df by Continent. Then we mutate all columns except the grouping column (and Country which is not numeric) the following way: If is.na is TRUE, we replace it with the median, and since we are grouped, it's going to be the median for the Continent group (if its not NA we replace it with itself). Finally we call ungroup for good measure to get back a 'normal' tibble.




回答2:


Here is a solution using the library dplyr. I called your dataframe ww and renamed your column:

library('dplyr')
ww %>% 
  rename(
    lit_rate = Adult.literacy.rate....
  ) %>% 
  group_by(
    Continent
  ) %>% 
  mutate(
    lit_rate = replace(
      lit_rate,
      is.na(lit_rate),
      median(lit_rate, na.rm = TRUE)
    )
  ) %>% 
  ungroup()


来源:https://stackoverflow.com/questions/60564823/how-do-i-get-the-median-of-multiple-columns-in-r-with-conditions-according-to-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!